Learning Spark: Lightning-Fast Big Data Analysis – Developer Deconstructed – Chapter 7 – Running on a Cluster

This is the seventh post in the Learning Spark book summary series[1].

Chapter 7 Running on a Cluster

A feature of Spark is the ability to run computations in parallel by using many machines running in cluster mode.  Even better is writing parallelized applications use the same API as previously shown examples.

Spark can run on a wide variety of cluster managers such as Hadoop YARN, Apache Mesos, and Spark’s own built-in Standalone cluster manager.

Architecture

Spark uses a master/slave architecture design with one central coordinator and many distributed workers. The central coordinator is referred to as the driver.  The driver collaborates with distributed workers called executors. The driver runs in its own Java process and each executor is run as separate Java processes. The combination of the driver and all its workers are considered the entire Spark application.

Driver

The Spark driver is the process where the main() method of the program is executed.  It is this process which creates a SparkContext, creates RDDs, and performs transformations and actions.

Launch a Spark shell is creating a driver program.  Also, as previously shown, the Spark shell comes preloaded with a SparkContext called sc. A driver performs two duties: converting a user program into tasks and scheduling tasks on executors

Executors

As already mentioned Spark executors are worker processes which run tasks in a Spark job.  The names executors and workers appear to be used interchangeably.

Cluster Manager

Spark depends on cluster managers to launch each executor. Spark supports a variety of cluster managers including Hadoop YARN and Mesos, as well as providing its own built-in Standalone cluster manager.

No matter which cluster manager, Spark provides a single script for all cluster managers used to submit programs: spark-submit.

Recap

The steps when a Spark application is run on a cluster:

1) Submit application using spark-submit script

2) spark-submit script launches the driver program and calls the main() method

3) Driver program coordinates with the cluster manager for resources and to launch executors

4) Cluster manager launches executors

5) Driver processes the RDD actions and transformations in the program

6) Driver sends work (tasks) to executors

7) Tasks are run on executor processes

8) If the driver’s main() method exits or it calls SparkContext.stop(), it will terminate the executors and release resources from the cluster manager.

Examples of packaging a Spark application in Python with py-files argument, Scala through sbt and Java via maven are provided.

Amazon EC2

Spark download provides a script to launch clusters on Amazon EC2. This script launches a set of EC2 nodes and installs the Standalone cluster manager on them.  The script is called spark-ec2

Which Cluster Manager to Use?

When starting a new deployment and looking to choose a cluster manager, the authors recommend the following guidelines: Start with a Standalone cluster. Standalone mode is the most straightforward to set up and provides practically all the same features as the other cluster managers. If you would like to run Spark alongside other application or require resource scheduling capabilities such as queues, both YARN and Mesos are better alternatives.

 

Featured image photo credit https://flic.kr/p/8wFrUX

[1] As mentioned in previous posts, I recommend purchasing Learning Spark: Lightning-Fast Big Data Analysis for a variety of reasons.  These posts could be used as a reference after you purchase the book.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">