Running an Apache Spark Cluster on your local machine is natural, early step towards Apache Spark proficiency. Let’s start understanding Spark cluster options by to running a cluster on a local machine. Running a local cluster is called “standalone” mode. This post will describe pitfalls to avoid and review how to run Spark Cluster locally, deploy to a local running Spark cluster, describe fundamental cluster concepts like Masters and Workers and finally set the stage for more advanced cluster options.
1. Start Master from a command prompt *
You should see something like the following:
starting org.apache.spark.deploy.master.Master, logging to /Users/toddmcgrath/Development/spark-1.1.0-bin-hadoop2.4/sbin/../logs/spark-toddmcgrath-org.apache.spark.deploy.master.Master-1-todd-mcgraths-macbook-pro.local.out
Open this file to check things out. You should be able to determine that http://localhost:8080 is now available for viewing:
The Spark Application Master is responsible for brokering resource requests by finding a suitable set of workers to run the Spark applications.
2. Start a Worker
todd-mcgraths-macbook-pro:spark-1.1.0-bin-hadoop2.4 toddmcgrath$ bin/spark-class org.apache.spark.deploy.worker.Worker spark://todd-mcgraths-macbook-pro.local:7077
Gotcha Warning: I tried shortcut to starting a Spark Worker by expecting some defaults. I made my first screencast here: http://youtu.be/pUB620wcqm0
1. bin/spark-class org.apache.spark.deploy.worker.Worker
2.bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077
3. bin/spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077
Finally, I tried using the URL from console:
toddmcgrath$ bin/spark-class org.apache.spark.deploy.worker.Worker spark://todd-mcgraths-macbook-pro.local:7077
Verify the Worker by viewing http://localhost:8080. You should see the worker:
Spark Workers are responsible for processing requests sent from the Spark Master.
3. Connect REPL to Spark Cluster (KISS Principle)
todd-mcgraths-macbook-pro:spark-1.1.0-bin-hadoop2.4 toddmcgrath$ ./bin/spark-shell --master spark://todd-mcgraths-macbook-pro.local:7077
If all goes well, you should see something similar to the following:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.1.0
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
14/12/06 12:44:32 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2014-12-06 12:44:33.306 java[22811:1607] Unable to load realm info from SCDynamicStore
14/12/06 12:44:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as sc.
And there you are. Ready to proceed. For more detailed analysis of standalone configuration options and scripts, see https://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts
This example of running a spark cluster locally is to ensure we’re ready to take on more difficult concepts such as using cluster managers such as YARN and Mesos. Also, we’ll cover configuring a Spark cluster at Amazon.
Also, before we move on to more advance Spark cluster setups, we’ll cover deploying and running a driver program to a Spark cluster.
* This post will use a Mac, so translate to your OS accordingly.