Spark Machine Learning – Chapter 11 Machine Learning with MLlib

Spark Machine Learning is contained with Spark MLlib.  Spark MLlib Spark’s library of machine learning (ML) functions designed to run in parallel on clusters.  MLlib contains a variety of learning algorithms. The topic of machine learning itself could fill many books, so instead, this chapter explains ML in Apache Spark.

This post is an excerpt from Chapter 11 Spark Machine Learning in our Apache Spark book Learning Spark Summary


MLlib invokes various algorithms on RDDs. As an example, MLlib could be used to identify spam through the following:

1) Create an RDD of strings representing email.

2) Run one of MLlib’s feature extraction algorithms to convert text into an RDD of vectors.

3) Call a classification algorithm on the RDD of vectors to return a model object to classify new points.

4) Evaluate the model on a test dataset using one of MLlib’s evaluation functions.

Some classic ML algorithms are not included with Spark MLib because they were not designed for parallel computations. MLlib contains several recent research algorithms for clusters, such as distributed random forests, K-means | |, and alternating least squares. MLlib is best suited for running machine learning algorithms on large datasets.

Machine Learning Basics

Machine learning algorithms try to predict or make decisions based on training data.  There are multiple types of learning problems, including classification, regression, or clustering.  All of which have different objectives.

All learning algorithms require defining a set of features for each item.  Then this set of features is sent into the learning function. For example, for an email, a set of features might include the server it comes from, the number of mentions of the word free, or the color of the text.

Pipelines often train multiple versions of a model and evaluate each one. To do this, separate the input data into “training” and “test” sets.

A very simple program for building a spam classifier in python is shown.  The code and data files are available in the book’s Git repository.

Data Types

MLlib contains a few specific data types including Vector, LabeledPoint, Rating, and various Model classes

Working with Vectors

Vectors come in two flavors: dense and sparse. Dense vectors store all their entries in an array of floating-point numbers.  Sparse vectors are usually preferable because they store nonzero values and their indices.


The key algorithms available in MLlib are Feature Extraction, TF-IDF, Scaling, Normalization, Word2Vec


Various statistics functions are available.

Classification and Regression

Classification and regression are two common forms of supervised learning where the difference between the two is the type of variable predicted.

Linear regression is one of the most common methods for regression.  This method predicts the output variable as a linear combination of the features.

Logistic regression is a binary classification method which identifies a linear separating plane between positive and negative examples such as Support Vector Machines (SVM).

Naive Bayes is a multiclass classification algorithm scoring how well each point belongs in each class based on a linear function of the features.

Decision trees are a flexible model used for both classification and regression.


Clustering is an unsupervised learning task involving the grouping of objects into clusters of high similarity.

MLlib includes the popular K-means algorithm for clustering, as well as a variant called K-means | |.

Collaborative Filtering and Recommendation

Collaborative filtering is a technique for recommender systems.  Users’ ratings and interactions with various products are used to make recommendations.

MLlib includes Alternating Least Squares (ALS) which is a popular algorithm for collaborative filtering.

Dimensionality Reduction

The main technique for dimensionality reduction used by the machine learning community is principal component analysis (PCA), but MLlib also provides the lower-level singular value decomposition (SVD) primitive.

Model Evaluation

Many learning tasks may be addressed with different models.

Tips and Performance Considerations

The following are suggested to consider: feature preparation, algorithm configuration, RDD caching, recognizing sparsity and level of parallelism.

Pipeline API

Based on the concept of pipelines, starting in Spark 1.2, MLlib is adding a new, higher-level API for machine learning. The pipeline API is similar to the one found in SciKit-Learn.


This post is an excerpt from Chapter 11 Spark Machine Learning in our Apache Spark book Learning Spark Summary

Featured image photo credit

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">