Paper of the week: Spark Streaming

Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

This week's paper comes from Berkeley and the creators of Spark, a Map-Reduce-like cluster computing framework, optimized for iterative jobs. In order to deal with real-time big data, they extended Spark and built a new programming model, discretized streams (D-Streams) and Spark Streaming.

A lot of applications that need to deal with "big data", also need to deal with them at the moment they arrive, i.e. data is mostly valuable when they are received. Identifying trending topics in Twitter or processing server log files in real time to detect a failure as fast as possible are a only a couple of examples.

Apart from enabling real-time big data processing, D-Streams also aim to provide solutions to three of the most important challenges of streaming processing systems: fault-tolerance, consistency and easy integration with batch processing. The system is built on the idea of treating stream computation as a series of deterministic batch computations on small time intervals. This approach directly benefits from the advantages of batch processing and also implies that all existing Spark operations can be reused in Spark Streaming, too. However, in order to minimize latency, intermediate state is kept in memory and not in disk like in batch systems. D-Streams also borrow from Spark the idea of RDDs (Resilient Distributed Datasets), a storage abstraction that can be used to rebuild lost data without using replication. This idea is extended in Spark Streaming to support parallel recovery in case of failures.

 

D-Streams

During an interval, all input data received is stored in the cluster and forms the input dataset for this particular interval. When the interval completes, data can be processed using common operations like map and reduce. Results and intermediate state are stored using RDDs. A D-Stream is a group of such RDDs. _


Example

pageViews = readStream("http://...", "1s")
ones = pageViews.map(event => (event.url, 1)
counts = ones.runningReduce((a, b) => a + b)

In this example, we group the page views data into D-Streams of 1 sec intervals. Each dataset for each interval is then transformed into a (URL, 1) set of tuples and feeds a running count.

 

Operators

D-Streams provide two types of operators. Transformation operators produce a new D-Stream when applied to a stream. Stateless transformation operators act independently on each interval, while stateful operators can operate on multiple intervals and may produce intermediate RDDs as a result. Aggregation over a sliding window is an example of a stateful operator. Output operators allow the program to write data to external systems, like HDFS. Additionally to Spark's operators, Spark Streaming also offers windowing, incremental aggregation and time-skew join operators.

_

Fault Tolerance

One of the biggest challenges of streaming processing is fault tolerance and recovery. Since data is mostly valuable at the time of its arrival and most of the applications do not have strong requirements on accuracy, some systems usually choose to allow data loss and resume execution from a future point. Other approaches include replication or upstream backup, both expensive solutions, either in capacity or latency. Spark Streaming introduces the idea of parallel recovery. RDDs are periodically checkpointed and asynchronously replicated. RDDs track their lineage, i.e. they can be rebuilt using a set of deterministic operations. Exploiting this property, when a node in Spark Streaming fails, missing RDD partitions are detected and parallel tasks are launched for their recovery.
_

Notes

I found this paper cited by Muppet, another processing system for streaming data. I enjoyed reading Muppet a lot and I found most of its design decisions very interesting. It could be a very good candidate for my "paper of the week" post for this week, but I was disappointed by the lack of an evaluation section or source code. On the other hand, it provided me with an excellent list of references to look into, one of which is the discussed paper in this post.

 

Links:

 

Happy coding and reading,

V.