Spark Streaming based applications are tracking statistics about page views in real time, train a machine learning model, or automatically detect anomalies.
The abstraction in Spark Streaming is called DStreams or discretized streams. A DStream is a sequence of data which arrives over time. Internally, each DStream is represented as a sequence of RDDs arriving at each time step. DStreams can be created from various input sources, such as Flume, Kafka, or HDFS.
DStreams offer two types of operations: transformations, which yield a new DStream, and outputs, which write data to an external system.
Note to Python devs: As of Spark 1.1, Spark Streaming is available only in Java and Scala. Experimental Python support was added in Spark 1.2, though it supports only text data.
Example Spark Streaming: Streaming filter for printing lines containing “error” in Scala
// Create a StreamingContext with a 1-second batch size from a SparkConf
val ssc = new StreamingContext(conf, Seconds(1))
// Create a DStream using data received after connecting
// to port 7777 on the local machine
val lines = ssc.socketTextStream("localhost",7777)
// Filter our DStream for lines with "error"
val errorLines = lines.filter(_.contains("error"))
// Print out the lines
This sets up only the computation that will be done when the system receives data. To start receiving, need to explicitly call start() on the StreamingContext.
// Start our streaming context and wait for it to "finish"
// Wait for the job to finish
Architecture and Abstraction
Spark Streaming uses a “micro-batch” architecture. This means computation is a continuous series of batch operations on small batches of data. New batches are created at regular time intervals typically configured between 500 milliseconds and several seconds.
For each streaming input source, a receiving task is launched within the application’s executors. The executors receive the input data and replicate it (by default) to another executor for fault tolerance. Data is stored memory of the executors the same way as cached RDDs.
The same fault-tolerance properties for DStreams are available as Spark has for RDDs.
Transformations on DStreams are either stateless or stateful.
Stateless transformations, as the name implies, are simple RDD transformations being applied on every batch with no regard to previous or future transformations in the streaming pipeline.
Stateful transformations track data from previous batches and are used to generate new results for the batch executor.
Windowed operations compute results across a longer time period than the StreamingContext’s batch interval. Windowed operations combine results from multiple batches.
When maintaining state across the batches in a DStream such as track clicks as a user visits a site, updateStateByKey() is suggested as a state variable for DStreams of key/ value pairs.
Output operations use the final transformed data in a stream and push it to an external database or print it to the screen.
Spark Streaming has built-in support for a number of different data sources. Some “core” sources are built into the Spark Streaming while others are available through libraries.
Core Sources include steam of files and Akka actor stream.
Additional input sources include Apache Kafka and Apache Flume.
Finally, developers are able to create their own input source receivers.
Multiple Sources and Cluster Sizing
DStreams may be combined using operations like union(). Through these operators, data can combined from multiple input DStreams.
An advantage of Spark Streaming is providing fault tolerance guarantees. As long as the input data is stored reliably, Spark Streaming can always compute the correct result from it.
Checkpointing needs to be set up to provide fault tolerance in Spark Streaming. Checkpointing periodically saves data from the application to a reliable storage system, such as HDFS or Amazon S3. This saved data can be use in recovering.
Checkpointing serves two purposes: 1) limit the state that must be recomputed on failure and 2) provide fault tolerance for the driver.
Driver Fault Tolerance
If the driver program in a streaming application crashes, it can be relaunched and recover from a checkpoint. Tolerating failures of the driver node requires a special way of creating our StreamingContext via StreamingContext.getOrCreate() function.
Worker Fault Tolerance
Spark Streaming uses the same techniques as Spark for its fault tolerance of streaming worker nodes.
Receiver Fault Tolerance
Whether receivers loses any of the data during a failure depends on the nature of the source such as whether the source can resend data or not. How? This can be determined whether the receiver updates the source about data received or not.
Spark Streaming’s worker fault-tolerance guarantees exactly-once semantics for all transformations.
Spark Streaming provides a special UI page to display attributes of streaming applications are doing. This is available in a Streaming tab on the normal Spark UI.
Spark Streaming applications have specialized tuning options including batch and window sizes, level of parallelism, garbage collection and memory usage.
Featured image photo credit https://flic.kr/p/drnQ6D