Spark Tutorials With Scala

Spark Tutorial with Scala

Spark provides developers and engineers with a Scala API.  The Spark tutorials with Scala listed below cover the Scala Spark API within Spark Core, Clustering, Spark SQL, Streaming, Machine Learning MLLib and more.

You may access the tutorials in any order you choose.

The tutorials assume a general understanding of Spark and the Spark ecosystem regardless of the programming language such as Scala.  If you are new to Apache Spark, the recommended path is starting from the top and making your way down to the bottom.

New Spark Tutorials are added often, so make sure to check back often or sign up for our notification list.

Apache Spark Essentials

Overview

It is essential you are comfortable with the Spark concepts of Resilient Distributed Datasets (RDD), Transformations, Actions.  If you need more information on these subjects from a non-Scala point of view, it is suggested to start at the Spark Tutorial page first and then return to this page.  In the following tutorials, the Spark fundaments are covered from a Scala perspective.

Tutorials

With these three fundamental concepts in mind, you are in a position to move any one of the following sections on clustering, SQL, Streaming and/or machine learning (MLlib) organized below.

Spark Clusters

Spark applications may run as independent sets of parallel processes distributed across numerous nodes of computers.  Numerous nodes collaborating together is commonly known as a cluster.  Spark processes are coordinated by a SparkContext object.  The SparkContext can connect to several types of cluster managers including Mesos, YARN or Spark’s own internal cluster manager called “Standalone”.  Once connected to the cluster manager, Spark acquires executors on nodes within the cluster.

Tutorials

The following Spark clustering tutorials will teach you about Spark cluster capabilities with Scala source code examples.

For more information on Spark Clusters, such as running and deploying on Amazon’s EC2, make sure to check the Integrations section at the bottom of this page.

Spark SQL with Scala

Spark SQL is the Spark component for structured data processing.  Spark SQL interfaces provide Spark with an insight into both the structure of the data as well as the processes being performed. There are multiple ways to interact with Spark SQL including SQL, the DataFrames API, and the Datasets API.  Developers may choose between the various Spark API approaches.

SQL

Spark SQL queries may be written using either a basic SQL syntax or HiveQL.  Spark SQL can also be used to read data from existing Hive installations. When running SQL from within a programming language such as Python or Scala, the results will be returned as a DataFrame. You can also interact with the SQL interface using JDBC/ODBC.

DataFrames

A DataFrame is a distributed collection of data organized into named columns. DataFrames can be considered conceptually equivalent to a table in a relational database, but with richer optimizations. DataFrames can be created from sources such as CSVs, JSON, tables in Hive, external databases, or existing RDDs.

Datasets

A Dataset is a new experimental interface added in Spark 1.6.  Datasets try to provide the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine.

Spark SQL with Scala Tutorials

Readers may also be interested in pursuing tutorials such as Spark with Cassandra tutorials located in the Integration section below.  Spark with Cassandra covers aspects of Spark SQL as well.

Spark Streaming with Scala

Spark Streaming is the Spark module which enables stream processing of live data streams. Data can be ingested from many sources like Kinesis, Kafka, Twitter, or TCP sockets including WebSockets.  The stream data may be processed with high-level functions such as map, join, or reduce.  Then, processed data can be pushed out of the pipeline to filesystems, databases, and dashboards.

Spark’s MLLib algorithms may be used on data streams as shown in tutorials below.

Spark Streaming receives live input data streams by dividing the data into configurable batches.

Spark Streaming provides a high-level abstraction called discretized stream or “DStream” for short.  DStreams can be created either from input data streams or by applying operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

Spark Streaming with Scala Tutorials

 

Spark Machine Learning

MLlib is Spark’s machine learning (ML) library component. The MLlib goal is to make machine learning easier and more widely available. It consists of popular learning algorithms and utilities such as classification, regression, clustering, collaborative filtering, dimensionality reduction.

Spark’s MLlib is divided into two packages:

  1. spark.mllib which contains the original API built over RDDs
  2. spark.ml built over DataFrames used for constructing ML pipelines

spark.ml is the recommended approach because the DataFrame API is more versatile and flexible.

Spark MLlib with Scala Tutorials

Spark Performance Monitoring and Debugging

Spark with Scala Integration Tutorials

The following Scala Spark tutorials build upon the previously covered topics into more specific use cases

 

Featured Image adapted from https://flic.kr/p/7zAZx7