How To: Apache Spark Cluster on Amazon EC2 Tutorial

How to setup and run Apache Spark Cluster on EC2?  This post will walk you through each step to get an Apache Spark cluster up and running on EC2. The cluster consists of one master and one worker node. It includes each step I took regardless if it failed or succeeded.  While your experience may not match exactly, I’m hoping these steps could be helpful as you attempt to run an Apache Spark cluster on Amazon EC2.  There are screencasts throughout the steps.

Overview

The basis for this tutorial are the ec2 scripts provided with Spark.  It wouldn’t hurt to spend a few minutes reading http://spark.apache.org/docs/latest/ec2-scripts.html to get an idea of what this Apache Spark Cluster on EC2 tutorial will cover.

Assumptions

This post assumes you have already signed-up and have a verified AWS account.  If not, sign up here https://aws.amazon.com/

Approach

I’m going to go through step by step and also show some screenshots and screencasts along the way.  For example, there is a screencast that covers steps 1 through 5 below.

Spark Cluster on Amazon EC2 Step by Step

Note: There’s a screencast of steps one through four at end of step five below.

1) Generate Key/Pair in EC2 section of AWS Console

Click “Key Pairs” in left nav and then Create Key Pair button.

 

Download the resulting key/pair PEM file.

2) Create a new AWS user named courseuser and download the file which includes the User Name, Access Key Id, Secret Access Key.  We need the Key Id and Secret Access Key.

3) Set your environment variables according to the key and id from the previous step.  For me, that meant running the following from the command line:

export AWS_SECRET_ACCESS_KEY=F9mKN6obfusicatedpBrEVvel3PEaRiC

export AWS_ACCESS_KEY_ID=AKIAobfusicatedPOQ7XDXYTA

4) Open a terminal window and goto the root dir of your Spark distribution.  Then, copy PEM file from first step in this tutorial to root of Spark home dir

5) From Spark home dir, run:

ec2/spark-ec2 –key-pair=courseexample –identity-file=courseexample.pem launch spark-cluster-example

I received errors about the PEM file permissions, so I changed according to the error notification recommendation and re-ran spark-ec2 script.

Then, you should receive permission errors from Amazon, so update permissions of courseuser on Amazon and try again.

You may receive an error about zone availability such as:

Your requested instance type (m1.large) is not supported in your requested Availability Zone (us-east-1b). Please retry your request by not specifying an Availability Zone or choosing us-east-1c, us-east-1e, us-east-1d, us-east-1a.

If so, just update the script zone argument and re-run:

ec2/spark-ec2 –key-pair=courseexample –identity-file=courseexample.pem –zone=us-east-1d launch spark-cluster-example

The cluster creation takes approximately 10 min with all kinds output including deprecated warnings and possibly errors starting GANGLIA.  GANGLIA errors are fine if you are just experimenting.  Try a different Spark version or you can tweak PHP settings on your Cluster.

Here’s a screencast example of me creating an Apache Spark Cluster on EC2

6) After the cluster creation succeeds, you can verify by going to master http://<your-ec2-hostname>.amazonaws.com:8080/

7) And you can verify from Spark console in Spark or Python

Scala example:

bin/spark-shell –master spark://ec2-54-145-64-173.compute-1.amazonaws.com:7077

Python example

IPYTHON_OPTS=”notebook” ./bin/pyspark –master spark://ec2-54-198-139-10.compute-1.amazonaws.com:7077

At first, both of these should have issues which eventually lead to an “ERROR OneForOneStrategy: java.lang.NullPointerException”:

 

8) This is an Amazon permission issue related to port 7077 not being open.  You need to open up port 7077 via an Inbound Rule.  Here’s a screencast on how to create an Inbound Rule in EC2:

After creating this inbound rule, everything will work from both ipython notebook and spark shell

Conclusion

Hope this helps you configure a Spark Cluster on EC2.  Let me know in the page comments if I can help.  Once you are finished with your EC2 instances, make sure to destroy using the following command:

ec2/spark-ec2 –key-pair=courseexample –identity-file=courseexample.pem destroy spark-cluster-example

Featured Image Credit: https://flic.kr/p/g19ivQ

12 thoughts on “How To: Apache Spark Cluster on Amazon EC2 Tutorial”

  1. Johnny Chiu

    Hi Supergloo,

    Firstly, I would like to thank you for the detailed tutorial. I am new to spark and would like to move into it. I have a question when following your tutorial and couldn’t find the way out after googling it. If possible, could you help me point out where I might do wrong?

    In step 7, I can verify from Spark console in Spark. However, when I try
    >> IPYTHON_OPTS=”notebook” ./bin/pyspark –master spark://ec2-52-91-57-24.compute-1.amazonaws.com:7077

    I got the following error:
    Exception in thread “main” java.lang.IllegalArgumentException: pyspark does not support any application options.
    at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:242)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildPySparkShellCommand(SparkSubmitCommandBuilder.java:241)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:117)
    at org.apache.spark.launcher.Main.main(Main.java:86)

    Could you help me out? Thanks!!

    • admin Post Author

      Glad this tutorial is helping you so far. What operating system are you on? Windows, Mac, Linux? Looks like Mac or Linux What happens when run without the IPYTHON_OPTS such as “./bin/pyspark –master spark://ec2-52-91-57-24.compute-1.amazonaws.com:7077” ?

    • admin Post Author

      Also, I’m wondering about the encoding of the hyphen character…. wondering if it’s being converted into a dash. Did you copy-and-paste from this tutorial by chance? If so, try typing in the command instead of copy-and-paste. Let us know how it goes.

      • Johnny Chiu

        Hi, I am using Mac, and it’s indeed a Hyphen & Dash issue. I have successfully open up ipython notebook after correcting it.
        Original: IPYTHON_OPTS=”notebook” ./bin/pyspark -–master spark://ec2-52-91-57-24.compute-1.amazonaws.com:7077
        Corrected: IPYTHON_OPTS=”notebook” ./bin/pyspark –master spark://ec2-52-91-57-24.compute-1.amazonaws.com:7077
        (the second hyphen is originally converted to a dash)

        I am appreciated for your reply and the help, thanks!

  2. Nourhan

    I am CS student, I’m new to using spark and ec2 and I’m kinda struggling with all these commands as I’m working on Windows.
    So, my question is how can I run these commands on windows ?
    like this one for example
    ec2/spark-ec2 –key-pair=courseexample –identity-file=courseexample.pem launch spark-cluster-example
    thanks.

    • admin Post Author

      Hi Nourhan, install Python and try running ec2/spark_ec2.py instead of ec2/spark-ec2. If you open ec2/spark-ec2 in a text editor, you’ll see it just calls the Python spark_ec2.py script. Let us know how it goes.

  3. Nourhan

    Thank you for your patience
    I’ve changed the environment variables
    and then from spark\ec2
    I tried running Spark-Ec2.py on its own but nothing happened and then
    I wrote this
    Spark-Ec2.py –Key-Pair=Courseexample –Identity-File=Courseexample.Pem Launch Spark-Cluster-Example
    and changed they key pair name to the one I am using and no action happens
    no errors no messages
    nothing

  4. Valli

    Hi – I am getting the same error as Johny. But I’ve tried correcting the hyphen , dash issue , but no success. Any clues where I am faltering.

    I am running on VMWare with Ubuntu. (includes Spark 1.6.1 / Java 1.7 / Scala 2.11.8 /Python 2.7) – NO Hadoop installed on this.

    IPYTHON_OPTS=”notebook” /opt/spark-1.6.1/bin/pyspark -master spark://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:7077
    Exception in thread “main” java.lang.IllegalArgumentException: pyspark does not support any application options.
    at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:242)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildPySparkShellCommand(SparkSubmitCommandBuilder.java:241)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:117)
    at org.apache.spark.launcher.Main.main(Main.java:86)

    Regards

  5. Prithwis

    Thank you for this detailed tutorial. Everything works as described. I created my AWS EC2 cluster using the spark-ec2 script and also managed to connect my Ipython notebook to the cluster at the AWS master node:7077. However when I tried to run the simple wordcount program (that runs faultlessly in a local standalone mode ) it goes into an indefinite wait stage saying “WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources”

    I checked this out in stackoverflow (where else?) and as advised by some, changed the number of cores to 1 and memory to 512mb — this does not help.

    then i came across another stackoverflow thread [ http://stackoverflow.com/questions/25176197/with-spark-how-to-connect-master-or-solve-an-errorwarn-taskschedulerimpl-init ] quite clearly states that this script sets up the cluster in standalone mode and hence will NEVER accept a remote submit … hence useless!

    finally, i note that the latest version of spark documentation does not refer to this script at all, even though the script is present in the distribution.

    would appreciate any advice in this regards

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">