Spark RDD – A Two Minute Guide for Beginners

What is Spark RDD?

Spark RDD is short for Apache Spark Resilient Distributed Dataset.  A Spark Resilient Distributed Dataset is often shortened to simply RDD.  RDDs are a foundational component of the Apache Spark large scale data processing framework.

Spark RDDs are an immutable, fault-tolerant, and possibly distributed collection of data elements.  RDDs may be operated on in parallel across a cluster of computer nodes.  To operate in parallel, RDDs are divided into logical partitions.  Partitions are computed on different nodes of the cluster through Spark Transformation APIs. RDDs may contain a type of Python, Java, or Scala objects, including user-defined classes.

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value by performing a computation on the RDD.

How are Spark RDDs created?

Spark RDDs are created through the use of Spark Transformation functions.  Transformation functions create new RDDs from a variety of sources; e.g. textFile function from a local filesystem, Amazon S3 or Hadoop’s HDFS.  Transformation functions may also be used to create new RDDs from previously created RDDs.  For example, an RDD of all the customers from only North America could be constructed from an RDD of all customers throughout the world.

In addition to loading text files from file systems, RDDs may be created from external storage systems such as JDBC databases such as mySQL, HBase, Hive, Casandra or any data source compatible with Hadoop Input Format.

RDDs are also created an manipulated when using Spark modules such as Spark Streaming and Spark MLlib.

Why Spark RDD?

Spark makes use of data abstraction through RDDs to achieve faster and more efficient performance than Hadoop’s MapReduce.

RDDs support in-memory processing.  Accessing data from memory is 10 to 100 times faster than accessing data from a network or disk.  Data access from disk often occurs in Hadoop’s MapReduce-based processing.

In addition to performance gains, working through an abstraction layer provides a convenient and consistent way for developers and engineers to work with a variety of data sets.

When to use Spark RDDs?

RDDs are utilized to perform computations on an RDD dataset through Spark Actions such as a count or reduce when answering questions such as “how many times did xyz happen?” or “how many times did xyz happen by location?”

Often, RDDs are transformed into new RDDs in order to better prepare datasets for future processing downstream in the processing pipeline.  To reuse a previous example, let’s say you want to examine North America customer data and you have an RDD of all worldwide customers in memory.  It could be beneficial from a performance perspective to create a new RDD for North America only customers instead of using the much larger RDD of all worldwide customers.

Depending on the Spark operating environment and RDD size, RDDs should be cached (via cache function) or persisted to disk when there is an expectation for the RDD to be utilized more than once.

 

Conclusion Resources

Learning Spark book

Scala Transformation API examples

Python Transformation API examples

Hadoop Input Format API docs

 

Featured Image credit https://flic.kr/p/7TqgUV

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">