Spark Tutorials With Scala
Spark provides developers and engineers with a Scala API. The Spark tutorials with Scala listed below cover the Scala Spark API within Spark Core, Clustering, Spark SQL, Streaming, Machine Learning MLLib and more.
You may access the tutorials in any order you choose.
The tutorials assume a general understanding of Spark and the Spark ecosystem regardless of the programming language such as Scala. If you are new to Apache Spark, the recommended path is starting from the top and making your way down to the bottom.
New Spark Tutorials are added often, so make sure to check back often or sign up for our notification list.
- 1 Apache Spark Essentials
- 2 Spark Clusters
- 3 Spark SQL with Scala
- 4 Spark Streaming with Scala
- 5 Spark Machine Learning
- 6 Spark Performance Monitoring and Debugging
- 7 Spark with Scala Integration Tutorials
- 8 Spark Operations
Apache Spark Essentials
It is essential you are comfortable with the Spark concepts of Resilient Distributed Datasets (RDD), Transformations, Actions. If you need more information on these subjects from a non-Scala point of view, it is suggested to start at the Spark Tutorial page first and then return to this page. In the following tutorials, the Spark fundaments are covered from a Scala perspective.
With these three fundamental concepts in mind, you are in a position to move any one of the following sections on clustering, SQL, Streaming and/or machine learning (MLlib) organized below.
Spark applications may run as independent sets of parallel processes distributed across numerous nodes of computers. Numerous nodes collaborating together is commonly known as a cluster. Spark processes are coordinated by a SparkContext object. The SparkContext can connect to several types of cluster managers including Mesos, YARN or Spark’s own internal cluster manager called “Standalone”. Once connected to the cluster manager, Spark acquires executors on nodes within the cluster.
The following Spark clustering tutorials will teach you about Spark cluster capabilities with Scala source code examples.
- Cluster Part 1 Run Standalone
- Cluster Part 2 Deploy a Scala program to the Cluster
- Spark Cluster Deploy Troubleshooting
- Accumulators and Broadcast variables
For more information on Spark Clusters, such as running and deploying on Amazon’s EC2, make sure to check the Integrations section at the bottom of this page.
Spark SQL with Scala
Spark SQL is the Spark component for structured data processing. Spark SQL interfaces provide Spark with an insight into both the structure of the data as well as the processes being performed. There are multiple ways to interact with Spark SQL including SQL, the DataFrames API, and the Datasets API. Developers may choose between the various Spark API approaches.
Spark SQL queries may be written using either a basic SQL syntax or HiveQL. Spark SQL can also be used to read data from existing Hive installations. When running SQL from within a programming language such as Python or Scala, the results will be returned as a DataFrame. You can also interact with the SQL interface using JDBC/ODBC.
A DataFrame is a distributed collection of data organized into named columns. DataFrames can be considered conceptually equivalent to a table in a relational database, but with richer optimizations. DataFrames can be created from sources such as CSVs, JSON, tables in Hive, external databases, or existing RDDs.
A Dataset is a new experimental interface added in Spark 1.6. Datasets try to provide the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine.
Spark SQL with Scala Tutorials
Readers may also be interested in pursuing tutorials such as Spark with Cassandra tutorials located in the Integration section below. Spark with Cassandra covers aspects of Spark SQL as well.
Spark Streaming with Scala
Spark Streaming is the Spark module which enables stream processing of live data streams. Data can be ingested from many sources like Kinesis, Kafka, Twitter, or TCP sockets including WebSockets. The stream data may be processed with high-level functions such as
reduce. Then, processed data can be pushed out of the pipeline to filesystems, databases, and dashboards.
Spark’s MLLib algorithms may be used on data streams as shown in tutorials below.
Spark Streaming receives live input data streams by dividing the data into configurable batches.
Spark Streaming provides a high-level abstraction called discretized stream or “DStream” for short. DStreams can be created either from input data streams or by applying operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
Spark Streaming with Scala Tutorials
- Spark Streaming Overview
- Spark Streaming Example Streaming from Slack
- Spark Streaming with Kafka Tutorial
- Spark Streaming Testing
Spark Machine Learning
MLlib is Spark’s machine learning (ML) library component. The MLlib goal is to make machine learning easier and more widely available. It consists of popular learning algorithms and utilities such as classification, regression, clustering, collaborative filtering, dimensionality reduction.
Spark’s MLlib is divided into two packages:
- spark.mllib which contains the original API built over RDDs
- spark.ml built over DataFrames used for constructing ML pipelines
spark.ml is the recommended approach because the DataFrame API is more versatile and flexible.
Spark MLlib with Scala Tutorials
Spark Performance Monitoring and Debugging
Spark with Scala Integration Tutorials
The following Scala Spark tutorials build upon the previously covered topics into more specific use cases
- Spark Amazon S3 Tutorial
- Spark Deploy to an EC2 Cluster Tutorial
- Spark Cassandra from Scala Tutorial
- Spark Scala in IntelliJ
The following Scala Spark tutorials are related to operational concepts
Featured Image adapted from https://flic.kr/p/7zAZx7