Apache Spark, Cassandra and Game of Thrones

Apache Spark with Cassandra is a powerful combination in data processing pipelines.  In this post, we will build a Scala application with the Spark Cassandra combo and query battle data from Game of Thrones.  Now, we’re not going to make any show predictions!   But, we will show the most aggressive kings as well as kings which were attacked the most.  So, you’ve got that goin for you.

Overview

Our primary focus is the technical highlights of Spark Cassandra integration with Scala.  To do so, we will load up Cassandra with Game of Thrones battle data and then query it from Spark using Scala.  We’ll use Spark from both a shell as well as deploying a Spark Driver program to a cluster.  We’ll have examples of Scala case class marshalling courtesy of the DataStax connector as well as using SparkSQL to produce DataFrames.   We’re also mix in some sbt configuration as well.

There are screencasts and relevant links at the bottom of this post in the “Resources” section.

The intended audience of this Spark Cassandra tutorial is someone with beginning to intermediate experience with Scala and Apache Spark.  If you would like to reach this level quickly and efficiently, please check out our On-Demand Apache Spark with Scala Training Course.

Pre-Requisites

  1. Apache Cassandra (see resources below)
  2. Downloaded Game of Thrones data (see resources below)
  3. Apache Spark

Outline

  1. Import data into Cassandra
  2. Write Scala code
  3. Test Spark Cassandra code in SBT shell
  4. Deploy Spark Cassandra to Spark Cluster with SBT and spark-submit

Spark Cassandra Example

Part 1: Prepare Cassandra

Let’s import the GOT battle data into Cassandra.  To keep things simple, I’m going to use a local running Cassandra instance.  I started Cassandra with bin/cassandra script on my Mac.  (use cassandra.bat on Windows, but you knew that already.).

Next, connect to Cassandra with cqlsh and create a keyspace to use.  This tutorial creates a “gameofthrones” keyspace:

From here, we create a table for the battle data.

Then import the battles data using Cassandra COPY shown below.  (see Resouces section below for where to download data).  BTW, I needed to run a Perl script to update the end-of-line encodings from Mac to Unix on the CSV file using perl -pi -e 's/\r/\n/g.  Your mileage may vary.

That wraps Part 1.  Let’s move on to Part 2 where we’ll write some Scala code.

Part 2: Spark Cassandra Scala Code

(Note: All of the following sample code if available from Github.  Link in Resources section below.)

To begin, let’s layout the skeleton structure of the project –

Next, we’re going to add some files for sbt and the sbt-assembly plugin.

First the build file for sbt

got-battles/build.sbt file:

and the 1 line got-battles/project/assembly.sbt file:

And now let’s create the Spark driver code in got-battles/src/main/scala/com/supergloo called SparkCassandra.scala

 

Part 3: Run Spark Cassandra Scala Code from SBT Console

Start up the sbt console via sbt.  Once ready, you can issue the run command with an argument for your Spark Master location; i.e. run local[5]

(Again, there’s a screencast at the end of this post which shows an example of running this command.  See Resources section below.)

Depending on your log level, you should see various SparkCassandra outputs from the SparkCassandra code.  These console outputs from Cassandra is the indicator of success.  Oh yeah.  Say it with me now.  Oh yeahhhhh

Running code in the sbt console is a convenient way to make and test changes rapidly.  As I developed this code, there was a terminal open in one window and an editor open in another window.  Whenever I made a Scala source code change and saved, I could simply re-run in the sbt console.

So now, let’s say we’ve reached the point of wanting to deploy this Spark program.  Let’s find out in the next section.

Part 4: Assemble Spark Cassandra Scala Code and Deploy to Spark Cluster

To build a jar and deploy to a Spark cluster, we need to make a small change to our build.sbt file.  As you may have noticed from the code above, there are comments in the file which indicate what needs to be changed.  We should uncomment this line:

and comment out this line:

then, we can run sbt assembly from command-line to produce a Spark deployable jar.  If you use the sample build.sbt file this will produce target/scala-2.10/spark-cassandra-example-assembly-1.0.jar

To deploy, use spark-submit with the appropriate arguments; i.e.

Conclusion

So, what do you think?  When you run the code, you can see the most aggressive kings and the kings which were attacked the most.  Without giving it away, I think one could argue whether Mance Rayder should be tied with Renly Baratheon on the most attacked list.  But, that’s not really the point of this tutorial.  As for the code and setup, do you have any questions, opinions or suggestions for next steps?   

Spark Cassandra Tutorial Screencast

In the following screencast, I run through the steps described in this tutorial.  Stay tuned because there is blooper footage at the end of the screencast.  Because I mean, why not bloopers.

 

Spark Cassandra Tutorial Resources

  1. All source code including the battles.csv file I scrubbed using the perl script described above at Apache Spark Cassandra Example code
  2. https://github.com/datastax/spark-cassandra-connector
  3. DataFrames with Cassandra Connector
  4. Game of Thrones Data: https://github.com/chrisalbon/war_of_the_five_kings_dataset

 

 

Featured image credit https://flic.kr/p/dvpku1

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">