Spark Performance Monitoring with Metrics, Graphite and Grafana

spark-performance-monitor-with-graphite
/ Categories: Spark Comments: no comments

Spark is distributed with the Metrics Java library which can greatly enhance your abilities to diagnose issues with your Spark jobs.  In this post, we’ll cover how to configure Metrics to report to a Graphite backend and view the results with Grafana. Optional, 20 Second Background If you already know about Metrics, Graphite and Grafana,

read more

Spark Broadcast and Accumulator Examples in Scala

Spark Shared Variables Broadcast and Accumulators
/ Categories: Spark Comments: no comments

Spark Broadcast and Accumulator Overview So far, we’ve learned about distributing processing tasks across a Spark cluster.  But, let’s go a bit deeper in a couple of approaches you may need when designing distributed tasks.  I’d like to start with a question.  What do we do when we need each Spark worker task to coordinate certain

read more

Apache Kafka – A Few Things You Should Know

what-is-apache-kafka-sm
/ Categories: Kafka Comments: no comments

What is Apache Kafka? Simply put, Kafka is a messaging system, making it possible for data to be moved in between systems. For the laymen, however, the concept may not be easy to understand. Technically, it is defined as a commit log service that is distributed, partitioned, and replicated. It provides a messaging system that

read more

IntelliJ Scala and Apache Spark – Well, Now You Know

Intellij Scala Spark
/ Categories: Spark Comments: no comments

IntelliJ Scala and Spark Setup Overview In this post, we’re going to review one way to setup IntelliJ for Scala and Spark development.  The IntelliJ Scala combination is the best, free setup for Scala and Spark development.  And I have nothing against ScalaIDE (Eclipse for Scala) or using editors such as Sublime.  I switched from

read more

Spark Streaming Testing with Scala Example

Spark Streaming Testing
/ Categories: Spark Comments: no comments

Spark Streaming Testing How do you create and automate tests of Spark Streaming applications?  In this post, we’ll show an example of one way in Scala.  This post is heavy on code examples and has the added bonus of using a code coverage plugin. Are the tests in this tutorial examples unit tests?  Or, are

read more

Apache Kafka and Amazon Kinesis – How do they compare?

apache kafka
/ Categories: Kafka Comments: no comments

Apache Kafka vs. Amazon Kinesis Like many of the offerings from Amazon Web Services, Amazon Kinesis software is modeled after an existing Open Source system.  In this case, Kinesis is modeled after Apache Kafka. Kinesis is known to be incredibly fast, reliable and easy to operate.  Similar to Kafka, there are plenty of language specific clients

read more

Learning Spark PDF

learning spark pdf
/ Categories: Spark Comments: no comments

So, I’ve noticed “Learning Spark PDF” is a search term which happens on this site.  Can someone help me understand what people are looking for when using this phrase? Are readers looking for the Learning Spark: Lightning-Fast Big Data Analysis book from O’Reilly? Perhaps looking for the new Apache Spark with Scala Tutorial book? It’s

read more

Apache Spark, Cassandra and Game of Thrones

Spark Cassandra tutorial
/ Categories: Spark Comments: no comments

Apache Spark with Cassandra is a powerful combination in data processing pipelines.  In this post, we will build a Scala application with the Spark Cassandra combo and query battle data from Game of Thrones.  Now, we’re not going to make any show predictions!   But, we will show the most aggressive kings as well as

read more

Spark RDD – A Two Minute Guide for Beginners

spark rdd
/ Categories: Spark Comments: no comments

What is Spark RDD? Spark RDD is short for Apache Spark Resilient Distributed Dataset.  A Spark Resilient Distributed Dataset is often shortened to simply RDD.  RDDs are a foundational component of the Apache Spark large scale data processing framework. Spark RDDs are an immutable, fault-tolerant, and possibly distributed collection of data elements.  RDDs may be

read more

Apache Spark Machine Learning Example with Scala

Spark Machine Learning Example
/ Categories: Spark Comments: 3 Comments

In this Apache Spark Machine Learning example, Spark MLlib will be introduced and Scala source code reviewed.  This post and accompanying screencast videos will demonstrate a custom Spark MLlib Spark driver application.  Then, the Spark MLLib Scala source code will be examined.  There will be many topics shown and explained, but first, let’s describe a

read more