Learning Spark: Lightning-Fast Big Data Analysis – Developer Deconstructed – Chapter 6

This is the sixth post in the Learning Spark book summary series[1].

Chapter 6 Advanced Spark Programming

Overview

Two types of shared variables: accumulators to aggregate information and broadcast variables to efficiently distribute large values are introduced.  The examples use ham radio operators’ call logs as the input.

Dividing work on a per-partition basis allows us to optimize Spark programs.  Opening a database connection or creating a random-number generator are examples to avoid doing for each element.  So, Spark has per-partition versions of map and foreach functions to help reduce the cost burden of running these operations.

Finally, Spark provides methods for interacting with external programs such as R scripts.

Accumulators

Accumulators provide a syntax to aggregate values from all worker nodes back to the driver program.  For example, a common pattern for accumulators is counting events which occur during a job execution.  An example using ham radio operators’ call log is provided to count the number of times the log contains a blank line.

Scala and Python examples are provided which use the SparkContext from console:

blankLines = sc.accumulator(0)

How do accumulators work?

As seen in code example above, Accumulators are created in the driver program by calling the SparkContext.accumulator( initialValue). The return type is an org.apache.spark.Accumulator[ T] object, where T is the type of initialValue. Worker code can add to the accumulator with += method (or add in Java). The driver program can obtain value of the accumulator by calling value().

Spark supports accumulators of type Double, Long, and Float natively and developers may write their own custom accumulators as well.

Accumulators and Fault Tolerance

Spark automatically deals with failed or slow machines by re-executing failed or slow tasks.

How does this fault tolerance potentially affect accumulators?

In order to obtain a true value counter which is immune to any node failures or retry of failed tasks, it must be put inside an action like foreach().

Broadcast Variables

Spark broadcast variables are used by the driver program to efficiently send a large, read-only values to all the worker nodes. For example, if your application needs to send a large, read-only lookup reference table to all the nodes, broadcasting this as a variable is a solid approach.

Broadcast Background

It’s important to remember that Spark automatically sends all variables referenced in closures to the worker nodes. While this is convenient, it can also be inefficient because the same variable may be used in multiple parallel operations, but Spark will send it separately for each operation.  Hence, the value of broadcast variables.

Show examples of using broadcast variables in ham radio logs; Scala example:

val signPrefixes = sc.broadcast( loadCallSignTable())

where loadCallSignTable might be a lookup of ham radio call logs.

Broadcast Optimization

When broadcasting large values, it is important to choose an appropriate data serialization format.  Java Serialization, the default serialization library used in Spark’s Scala and Java APIs, can be very inefficient except arrays of primitive types. Optimizations may be made by utilizing an alternative serialization library such as Kryo.  This will be covered in more detail in Chapter 8.

Utilizing Partitions for Performance

As previously mentioned, working with data on a per-partition basis allows us to avoid redoing setup work for each element. Spark has per-partition versions of map and foreach as well as mapPartitions(), mapPartitionsWithIndex(), foreachPartition()

Similar to previously mentioned serialization library for broadcast optimization, partitions and affect on performance is revisited in Chapter 8.

Spark and External Programs

Spark provides the pipe() method on RDDs to pipe data to programs in other languages, like R scripts.  Spark’s pipe() requires the external program to be able to read and write to Unix standard streams.

Numeric RDD Operations

Not mentioned in the overview, but Spark provides several statistics operations on RDDs which contain numeric data.

Methods available include count(), mean(), sum(), max(), min(), variance(), sampleVariance(), stdev(), sampleStdev()

These are in addition to the more complex statistical and machine learning methods we will be covered Chapter  11.

Reference

Entire source code from this chapter is available at previously mentioned github link and in particular:

src/python/ChapterSixExample.py

src/main/scala/com/oreilly/learningsparkexamples/scala/ChapterSixExample.scala

src/main/java/com/oreilly/learningsparkexamples/ java/ChapterSixExample.java

 

Featured image photo credit https://flic.kr/p/4EGvtE

[1] As mentioned in previous posts, I recommend purchasing Learning Spark: Lightning-Fast Big Data Analysis for a variety of reasons.  These posts could be used as a reference after you purchase the book.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">