Using ipython notebook with Apache Spark couldn’t be easier. This post will cover how to use ipython notebook (jupyter) with Spark and why it is best choice when using python with Spark.
This post assumes you have downloaded and extracted Apache Spark and you are running on a Mac or *nix. If you are on Windows see http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/
ipython notebook with Apache Spark
I recommend the use the Python 2.7 Anaconda Python distribution which can be downloaded here https://www.continuum.io/downloads. It contains more than 300 of the most popular python packages for science, math, engineering, and data analysis. Also, future python spark tutorials and python spark examples will use this distribution.
After you have Anaconda installed, you should make sure that ipyton notebook (Jupyter) is up to date. Run the following command in the Terminal (Mac/Linux) or Command Prompt (Windows):
conda update conda
conda update ipython
Ref: http://ipython.org/install.html in the section “I am getting started with Python” section
Launching ipython notebook with Apache Spark
1) In a terminal, go to the root of your Spark install and enter the following command
A browser tab should launch and various output to your terminal window depending on your logging level.
What’s going on here with IPYTHON_OPTS command to pyspark? Well, you can look at the source of bin/pyspark in a text editor. This section
# Determine the Python executable to use for the driver:
if [[ -n "$IPYTHON_OPTS" || "$IPYTHON" == "1" ]]; then
# If IPython options are specified, assume user wants to run IPython
# (for backwards-compatibility)
elif [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then
Verify Spark with ipython notebook
At this point, you should be able to create a new notebook and execute some python using the provided SparkContext. For example:
Here’s a screencast of running ipython notebook with pyspark on my laptop
In this screencast, pay special attention to your terminal window log statements. At the default log level of INFO, you should see the no errors in pyspark output. Also, when you start a new notebook, the terminal should show SparkContext sc being available for use, such as
INFO SparkContext: Running Spark version
Why use ipython notebook with Spark?
1) Same reasons you use ipython notebook without Spark such as convenience, easy to share and execute notebooks, etc.
2) Code completion. As the screencast shows, a python spark developer can hit the tab key for available functions or also known as code completion options.
Hope this helps, let me know if you have any questions.