Paper of the week: Spark Streaming
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters
This week's paper comes from Berkeley and the creators of Spark, a Map-Reduce-like cluster computing framework, optimized for iterative jobs. In order to deal with real-time big data, they extended Spark and built a new programming model, discretized streams (D-Streams) and Spark Streaming.
A lot of applications that need to deal with "big data", also need to deal with them at the moment they arrive, i.e. data is mostly valuable when they are received. Identifying trending topics in Twitter or processing server log files in real time to detect a failure as fast as possible are a only a couple of examples.
Apart from enabling real-time big data processing, D-Streams also aim to provide solutions to three of the most important challenges of streaming processing systems: fault-tolerance, consistency and easy integration with batch processing. The system is built on the idea of treating stream computation as a series of deterministic batch computations on small time intervals. This approach directly benefits from the advantages of batch processing and also implies that all existing Spark operations can be reused in Spark Streaming, too. However, in order to minimize latency, intermediate state is kept in memory and not in disk like in batch systems. D-Streams also borrow from Spark the idea of RDDs (Resilient Distributed Datasets), a storage abstraction that can be used to rebuild lost data without using replication. This idea is extended in Spark Streaming to support parallel recovery in case of failures.
D-Streams
During an interval, all input data received is stored in the cluster and forms the input dataset for this particular interval. When the interval completes, data can be processed using common operations like map and reduce. Results and intermediate state are stored using RDDs. A D-Stream is a group of such RDDs. _
Example
ones = pageViews.map(event => (event.url, 1)
counts = ones.runningReduce((a, b) => a + b)
In this example, we group the page views data into D-Streams of 1 sec intervals. Each dataset for each interval is then transformed into a (URL, 1) set of tuples and feeds a running count.
Operators
Notes
I found this paper cited by Muppet, another processing system for streaming data. I enjoyed reading Muppet a lot and I found most of its design decisions very interesting. It could be a very good candidate for my "paper of the week" post for this week, but I was disappointed by the lack of an evaluation section or source code. On the other hand, it provided me with an excellent list of references to look into, one of which is the discussed paper in this post.
Links:
Happy coding and reading,
V.

