Spark Read JDBC Examples with mySQL


In this Spark Read JDBC tutorial, we will cover using Spark SQL with a mySQL database.

Spark’s read JDBC methods allows us to read data and create DataFrames from a relational database supporting JDBC connectivity.

It is useful for a variety of reasons including leveraging Spark’s distributed computing capabilities for processing data stored in a traditional database.

Using read.format("jdbc") as shown in the Spark read JDBC examples belwo, we can specify a JDBC connection URL, username, and password, as well as any other JDBC properties required to connect to the database.

Once the data is loaded into a DataFrame, we can then use Spark’s SQL and DataFrame APIs to perform various transformations and analyses on the data. We will go through examples of this below.

Overall, Spark’s read from JDBC methods provides a fairly convenient way to access and process data stored in a traditional databases using Spark.

** Updated May 2023. This tutorial was originally published in 2015 **

Note: These Spark read examples take a Scala centric approach. If you are more interested in Python, check out the pyspark jdbc tutorial.

Table of Contents

Overview

In this Spark read JDBC tutorial, we’ll explore code examples of using Spark SQL mySQL by connecting over JDBC. 

We’re going to use mySQL with Spark in this tutorial, but you can apply the concepts presented here to any relational database which has a JDBC driver.

spark.read.jdbc() is a method in Spark’s DataFrameReader API to read data from a JDBC data source and create a DataFrame. The read.jdbc() method takes a JDBC connection URL, a table or query, and a set of optional parameters to specify how to connect to the database. We’ll go through specific examples below.

By the way, If you are not familiar with Spark SQL, there are a few Spark SQL tutorials on this site.

Spark Read JDBC Examples Requirements

1. Docker and Docker Compose

I will be using docker compose in these examples in case you’d like to follow along. There is a screencast with some really catchy intro music and effects below. I use the mysql.yml compose file which is available from the Github in Reference section below.

docker-compose -f mysql.yml up -d

Even if you don’t like catchy music, the screencast below shows setting up the mySQL environment used in these examples.

2. MySQL JDBC driver

Download available https://dev.mysql.com/downloads/connector/j/

3. Apache Spark

In the first example, we will run examples where the JDBC driver from previous step 2 needs to be in the classpath. I’ll explain more below.

4. Sequel Ace [Optional]

Sequel Ace is used to import the CSV file used in the Spark Read JDBC examples. There is more than one way to import a CSV file into mySQL as you probably already know, so I’m going to leave this step as optional because you may choose to import a different way.

We’re going to use baby_names.csv file as source data for the new table. This file is available from the Github repo mentioned below.

MySQL Setup for Spark JDBC Example Overview

I’ll list all steps to repeat the demo, but let’s cover the high level overview first.

In this first Spark read example, I’m going to run mySQL from a Docker container using docker-compose. The docker compose file used is available from the supergloo Github repo (link in Resources section below).

Next, this Spark JDBC examples with MySQL tutorial assumes a mysql database named “spark-jdbc” with table called “baby_names” has been created.  I’ll walk through how I created in case it’s helpful to you, but again, there are more than one way to do it.

Last, the “baby_names” table has been populated with the baby_names.csv data used in previous Spark tutorials. This CSV file is also available from Github repo in the Resources section below.

Here’s a screencast on YouTube of how I set up my environment:

In case you’re wondering, the SQL to create the baby_names table:

DROP TABLE IF EXISTS `baby_names`;

CREATE TABLE `baby_names` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `year` int(11) DEFAULT NULL,
  `first_name` varchar(100) DEFAULT NULL,
  `county` varchar(100) DEFAULT NULL,
  `sex` varchar(5) DEFAULT NULL,
  `count` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

If you have any questions about the environment setup, leave comments on this post.

Spark JDBC Overview

When reading or writing to a JDBC database from Apache Spark, the JDBC driver needs to be in the classpath. This JDBC jar file may be added to the classpath in a few different ways. I will show one way to do it in examples below, but you should know there are more than one way to do it.

Spark Read JDBC from Scala example
Spark Read JDBC from Scala Part 1

Spark Read JDBC with MySQL Example Tutorial Part 1

In this first example, I will show using spark-shell to connect the previously created mySQL database. In this code example, we’ll connect, perform a JDBC based read to create a DataFrame, register to a temp table, and then query with SQL.

Assuming you downloaded the requirements mentioned above and have your mySQL database container running with “sparksql” database and “baby_names” table created and loaded, here are the following steps to read from JDBC in Spark and query with SQL.

1. Add JDBC Driver to Spark Classpath

In this first example, I’m going to conduct the code examples from spark-shell. My SPARK_HOME environment variable is ~/dev/dev/spark-3.4.0-bin-hadoop3

I downloaded and extracted the mySQL JDBC Driver to ~/dev/mysql-connector-j-8.0.33

From my SPARK_HOME directory:

cp ../mysql-connector-j-8.0.33/mysql-connector-j-8.0.33.jar jars/

You will have to update your path and JAR file name accordingly. This is just how I ran it from my environment.

What did we just do? By copying the driver jar file into the jars/ directory the Driver used to read from a database will be available in the spark-shell classpath. We’re ready to move to the next step.

2. Spark Read JDBC

Once the Spark shell is running, let’s establish a connection to mySQL db and read the baby_names table:

$ bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/05 08:42:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.1.16:4040
Spark context available as 'sc' (master = local[*], app id = local-1683294155525).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.0
      /_/

Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 16.0.2)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val bn = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/supergloo").option("driver", "com.mysql.cj.jdbc.Driver").option("dbtable", "baby_names").option("user", "root").option("password", "example").load()
bn: org.apache.spark.sql.DataFrame = [Year: int, First Name: string ... 3 more fields]

Change the mySQL url and user/password values in the above code appropriate to your environment.

As you can hopefully see, we are creating a new DataFrame using spark.read.format with multiple option configuration settings. We can see from the output above, the Spark read JDBC operation was successful, but let’s take a look in the next step.

3. Confirm Spark Read JDBC Dataframe

Let’s confirm the DataFrame by show the schema of the table

scala> bn.show
+----+----------+-----------+---+-----+
|Year|First Name|     County|Sex|Count|
+----+----------+-----------+---+-----+
|2007|      ZOEY|      KINGS|  F|   11|
|2007|      ZOEY|    SUFFOLK|  F|    6|
|2007|      ZOEY|     MONROE|  F|    6|
|2007|      ZOEY|       ERIE|  F|    9|
|2007|       ZOE|     ULSTER|  F|    5|
|2007|       ZOE|WESTCHESTER|  F|   24|
|2007|       ZOE|      BRONX|  F|   13|
|2007|       ZOE|   NEW YORK|  F|   55|
|2007|       ZOE|     NASSAU|  F|   15|
|2007|       ZOE|       ERIE|  F|    6|
|2007|       ZOE|    SUFFOLK|  F|   14|
|2007|       ZOE|      KINGS|  F|   34|
|2007|       ZOE|     MONROE|  F|    9|
|2007|       ZOE|     QUEENS|  F|   26|
|2007|       ZOE|     ALBANY|  F|    5|
|2007|     ZISSY|   ROCKLAND|  F|    5|
|2007|     ZISSY|      KINGS|  F|   27|
|2007|      ZION|      KINGS|  M|   15|
|2007|      ZION|      BRONX|  M|   14|
|2007|       ZEV|   ROCKLAND|  M|    6|
+----+----------+-----------+---+-----+
only showing top 20 rows

4. Register Temp Table

Register the data as a temp table for future SQL queries

scala> bn.createOrReplaceTempView("baby_names")

5. Run SQL Query with JDBC based Data

At this point, we are ready to begin querying our JDBC based data source. A simple SQL query such as the following can be run:

scala> spark.sql("select * from baby_names").show
+----+----------+-----------+---+-----+
|Year|First Name|     County|Sex|Count|
+----+----------+-----------+---+-----+
|2007|      ZOEY|      KINGS|  F|   11|
|2007|      ZOEY|    SUFFOLK|  F|    6|
|2007|      ZOEY|     MONROE|  F|    6|
|2007|      ZOEY|       ERIE|  F|    9|
|2007|       ZOE|     ULSTER|  F|    5|
|2007|       ZOE|WESTCHESTER|  F|   24|
|2007|       ZOE|      BRONX|  F|   13|
|2007|       ZOE|   NEW YORK|  F|   55|
|2007|       ZOE|     NASSAU|  F|   15|
|2007|       ZOE|       ERIE|  F|    6|
|2007|       ZOE|    SUFFOLK|  F|   14|
|2007|       ZOE|      KINGS|  F|   34|
|2007|       ZOE|     MONROE|  F|    9|
|2007|       ZOE|     QUEENS|  F|   26|
|2007|       ZOE|     ALBANY|  F|    5|
|2007|     ZISSY|   ROCKLAND|  F|    5|
|2007|     ZISSY|      KINGS|  F|   27|
|2007|      ZION|      KINGS|  M|   15|
|2007|      ZION|      BRONX|  M|   14|
|2007|       ZEV|   ROCKLAND|  M|    6|
+----+----------+-----------+---+-----+
only showing top 20 rows

Spark JDBC Write and Read Example Part 2

Now, that we have a working example, let’s consider an example of Spark JDBC read and Spark JDBC write from a deployed application.

Rather than using the spark-shell, let’s use a Scala object called SparkSQLJDBCApp. The source code and build environment, pretty much everything you would need is in the Github repo linked below.

In this example, the code is similar to the above, but we now need to obtain a SparkSession manually and set spark.master accordingly.

Also, this example becomes more complete because it also writes back to a JDBC source as seen in the following code:

    pop.write.format("jdbc").options(jdbcWriteOptions).saveAsTable("popular_baby_names")

Here’s a video of me running it in IntelliJ and showing the results in mySQL.

As you can see in the video, I run the example from IntelliJ, but the source code is all set for you to assemble into a JAR file and deploy to a Spark cluster.

You can find more on using spark-submit here.

Conclusion Spark Read JDBC with MySQL

I created these Spark read JDBC and Spark write JDBC examples to get you up and running with Spark SQL and any JDBC compliant database quickly.  Also, it’s a reference for me too because I forget many things.

What other examples would you like to see with Spark SQL and Spark JDBC?  Leave ideas or questions in comments below.

Spark JDBC Examples Resources

Before you go, for more Spark tutorials, check out Spark SQL with Scala.

See also  Spark Read JSON: A Quick Guide in Scala
About Todd M

Todd has held multiple software roles over his 20 year career. For the last 5 years, he has focused on helping organizations move from batch to data streaming. In addition to the free tutorials, he provides consulting, coaching for Data Engineers, Data Scientists, and Data Architects. Feel free to reach out directly or to connect on LinkedIn

20 thoughts on “Spark Read JDBC Examples with mySQL”

    • Maybe I’m not understanding your question entirely, but you could write results back to mySQL outside of data frames of Spark SQL.

      Reply
  1. There is a problem when i try to do this on huge table.
    although I use the right statistic for num of partitions and lower and upper bounds I am dying because
    This connector generate 1 statement of ` select * from table ` instead of splitting it like it should. any idea?

    Reply
  2. I am trying to make my password secure using –password-file option but didn’t get success .can I do the same in spark for a secure access to database?

    Reply
  3. hi

    Im trying to execute spark program with my sql . Its running fine withing Eclipse Ide but not able to submit the jar file . It is saying java.lang.NoSuchMethodError: org.apache.spark.sql.DataFrameReader.load()Lorg/apache/spark/sql/DataFrame;
    I even tried changing pom.xml with different versions but of no use
    Could you help me out in resolving this

    Reply
  4. I am getting the error “java.lang.ClassNotFoundException: com.mysql.jdbc.Driver”
    Although I have downloaded required jar and I am running spark shell as below.
    ./bin/spark-shell –jars mysql-connector-java-5.1.39-bin.jar

    Can you please help ?

    Reply
  5. Brother You Rock!!!!! Truly.. You saved my time and you made it look so simple to implement. . 🙂

    Just 1 minor update is required in the article..

    spark-shell –jars –> This has double dash. In your article, it looks like single dash..

    But Thanks again. Keep posting 🙂

    Reply
  6. scala> val dataframe_mysql = sqlContext.read.format(“jdbc”).option(“url”, “jdbc:mysql://localhost/sparksql”).option(“driver”, “com.mysql.jdbc.Driver”).option(“dbtable”, “baby_names”).option(“user”, “root”).option(“password”, “root”).load()
    executing above statement throws an error => :1: error: unclosed string literal

    Reply
  7. Error:
    Caused by: java.net.ConnectException: Call From host to localhost:9000 failed on connection exception:
    I do not understand why it is pointing to hdfs location here ?

    Reply

Leave a Comment