In this tutorial, we will cover using Spark SQL with a mySQL database input data source.
Let’s show examples of using Spark SQL mySQL. We’re going to use mySQL in this tutorial, but you can apply the concepts presented here to any relational database which has a JDBC driver.
By the way, If you are not familiar with Spark SQL, there are a few Spark SQL tutorials on this site.
1. MySQL instance
2. MySQL JDBC driver (download available https://dev.mysql.com/downloads/connector/j/)
3. Previously used baby_names.csv file as source data.
The Spark SQL with MySQL JDBC example assumes a mysql db named “sparksql” with table called “baby_names”. The “baby_names” table has been populated with the baby_names.csv data used in previous Spark tutorials.
Here’s a screencast on YouTube of how I setup my environment:
The SQL to create the baby_names table:
DROP TABLE IF EXISTS `baby_names`;
CREATE TABLE `baby_names` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`year` int(11) DEFAULT NULL,
`first_name` varchar(100) DEFAULT NULL,
`county` varchar(100) DEFAULT NULL,
`sex` varchar(5) DEFAULT NULL,
`count` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
If you have any questions about the environment setup, leave comments on this post.
We need to pass in the mySQL JDBC driver jar when we start up the Spark Shell. (In a Spark application, any third party libs such as a JDBC driver would be included in package.)
From shell we’re going to establish a connection to the mySQL db and then run some queries via Spark SQL.
Spark SQL with MySQL (JDBC) Example Tutorial
1. Start the spark shell with –jars argument
$SPARK_HOME/bin/spark–shell –jars mysql-connector-java-5.1.26.jar
This example assumes the mySQL connector JDBC jar file is located in the same directory as where you are calling spark-shell. If it is not, you can specify the path location such as:
$SPARK_HOME/bin/spark–shell –jars /home/example/jars/mysql-connector-java-5.1.26.jar
2. Once the shell is running, let’s establish connection to mySQL db and read the baby_names table:
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.2
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79)
Type in expressions to have them evaluated.
Type :help for more information.
SQL context available as sqlContext.
scala> val dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:mysql://localhost/sparksql").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "baby_names").option("user", "root").option("password", "root").load()
Change the mySQL url and user/password values in the above code appropriate to your environment.
3. Let’s confirm the dataframe by show the schema of the table
3. Register the data as a temp table for future SQL queries
4. We are now in position to run some SQL such as
scala> dataframe_mysql.sqlContext.sql("select * from names").collect.foreach(println)
Conclusion Spark SQL with MySQL (JDBC)
This example was designed to get you up and running with Spark SQL and mySQL or any JDBC compliant database. What other examples would you like to see with Spark SQL and JDBC? Please leave ideas or questions in comments below.
Featured Image credit: https://flic.kr/p/f8KB7L