Spark Machine Learning – Chapter 11 Machine Learning with MLlib

Spark Machine Learning
/ Categories: Spark Comments: no comments

Spark Machine Learning is contained with Spark MLlib.  Spark MLlib Spark’s library of machine learning (ML) functions designed to run in parallel on clusters.  MLlib contains a variety of learning algorithms. The topic of machine learning itself could fill many books, so instead, this chapter explains ML in Apache Spark. This post is an excerpt

read more

Spark Streaming from Learning Spark Chapter 10

Spark Streaming
/ Categories: Spark, Summary Series Comments: no comments

Spark Streaming Spark Streaming based applications are tracking statistics about page views in real time, train a machine learning model, or automatically detect anomalies. The abstraction in Spark Streaming is called DStreams or discretized streams. A DStream is a sequence of data which arrives over time. Internally, each DStream is represented as a sequence of

read more

How To: Apache Spark Cluster on Amazon EC2 Tutorial

Spark Cluster on EC2
/ Categories: Spark Comments: 12 Comments

How to setup and run Apache Spark Cluster on EC2?  This post will walk you through each step to get an Apache Spark cluster up and running on EC2. The cluster consists of one master and one worker node. It includes each step I took regardless if it failed or succeeded.  While your experience may

read more

Learning Spark: Lightning-Fast Big Data Analysis – Developer Deconstructed – Chapter 7 – Running on a Cluster

Spark cluster managers
/ Categories: Spark Comments: no comments

This is the seventh post in the Learning Spark book summary series[1]. Chapter 7 Running on a Cluster A feature of Spark is the ability to run computations in parallel by using many machines running in cluster mode.  Even better is writing parallelized applications use the same API as previously shown examples. Spark can run on

read more

Learning Spark: Lightning-Fast Big Data Analysis – Developer Deconstructed – Chapter 6

Advanced Spark Programming
/ Categories: Spark, Summary Series Comments: no comments

This is the sixth post in the Learning Spark book summary series[1]. Chapter 6 Advanced Spark Programming Overview Two types of shared variables: accumulators to aggregate information and broadcast variables to efficiently distribute large values are introduced.  The examples use ham radio operators’ call logs as the input. Dividing work on a per-partition basis allows us

read more