Spark Performance Monitoring with Metrics, Graphite and Grafana

spark-performance-monitor-with-graphite
/ Categories: Spark Comments: no comments

Spark is distributed with the Metrics Java library which can greatly enhance your abilities to diagnose issues with your Spark jobs.  In this post, we’ll cover how to configure Metrics to report to a Graphite backend and view the results with Grafana. Optional, 20 Second Background If you already know about Metrics, Graphite and Grafana,

read more

Spark Broadcast and Accumulator Examples in Scala

Spark Shared Variables Broadcast and Accumulators
/ Categories: Spark Comments: no comments

Spark Broadcast and Accumulator Overview So far, we’ve learned about distributing processing tasks across a Spark cluster.  But, let’s go a bit deeper in a couple of approaches you may need when designing distributed tasks.  I’d like to start with a question.  What do we do when we need each Spark worker task to coordinate certain

read more

IntelliJ Scala and Apache Spark – Well, Now You Know

Intellij Scala Spark
/ Categories: Spark Comments: no comments

IntelliJ Scala and Spark Setup Overview In this post, we’re going to review one way to setup IntelliJ for Scala and Spark development.  The IntelliJ Scala combination is the best, free setup for Scala and Spark development.  And I have nothing against ScalaIDE (Eclipse for Scala) or using editors such as Sublime.  I switched from

read more

Spark Streaming Testing with Scala Example

Spark Streaming Testing
/ Categories: Spark Comments: no comments

Spark Streaming Testing How do you create and automate tests of Spark Streaming applications?  In this post, we’ll show an example of one way in Scala.  This post is heavy on code examples and has the added bonus of using a code coverage plugin. Are the tests in this tutorial examples unit tests?  Or, are

read more

Learning Spark PDF

learning spark pdf
/ Categories: Spark Comments: no comments

So, I’ve noticed “Learning Spark PDF” is a search term which happens on this site.  Can someone help me understand what people are looking for when using this phrase? Are readers looking for the Learning Spark: Lightning-Fast Big Data Analysis book from O’Reilly? Perhaps looking for the new Apache Spark with Scala Tutorial book? It’s

read more

Apache Spark, Cassandra and Game of Thrones

Spark Cassandra tutorial
/ Categories: Spark Comments: no comments

Apache Spark with Cassandra is a powerful combination in data processing pipelines.  In this post, we will build a Scala application with the Spark Cassandra combo and query battle data from Game of Thrones.  Now, we’re not going to make any show predictions!   But, we will show the most aggressive kings as well as

read more

Spark RDD – A Two Minute Guide for Beginners

spark rdd
/ Categories: Spark Comments: no comments

What is Spark RDD? Spark RDD is short for Apache Spark Resilient Distributed Dataset.  A Spark Resilient Distributed Dataset is often shortened to simply RDD.  RDDs are a foundational component of the Apache Spark large scale data processing framework. Spark RDDs are an immutable, fault-tolerant, and possibly distributed collection of data elements.  RDDs may be

read more

Apache Spark Machine Learning Example with Scala

Spark Machine Learning Example
/ Categories: Spark Comments: 3 Comments

In this Apache Spark Machine Learning example, Spark MLlib will be introduced and Scala source code reviewed.  This post and accompanying screencast videos will demonstrate a custom Spark MLlib Spark driver application.  Then, the Spark MLLib Scala source code will be examined.  There will be many topics shown and explained, but first, let’s describe a

read more

Apache Spark Advanced Cluster Deploy Troubleshooting

spark cluster deploy troubleshooting
/ Categories: Spark Comments: no comments

In this Apache Spark example tutorial, we’ll review a few options when your Scala Spark code does not deploy as anticipated.  For example, does your Spark driver program rely on a 3rd party jar only compatible with Scala 2.11, but your Spark Cluster is based on Scala 2.10?  Maybe your code relies on a newer version

read more

Spark Scala with 3rd Party JARs Deploy to a Cluster

Spark Apache Cluster Deploy with 3rd Party Jars
/ Categories: Spark Comments: no comments

Overview In this Apache Spark cluster deploy tutorial, we’ll cover how to deploy Spark driver programs to a Spark cluster when the driver program utilizes third-party jars.  In this case, we’re going to use code examples from previous Spark SQL and Spark Streaming tutorials. At the end of this tutorial, there is a screencast of

read more