Spark S3 Integration: A Comprehensive Guide

Spark S3 Guide

Apache Spark is an open-source distributed computing system providing fast and general-purpose cluster-computing capabilities for big data processing. Amazon Simple Storage Service (S3) is a scalable, cloud storage service originally designed for online backup and archiving of data and applications on Amazon Web Services (AWS), but it has involved into basis of object storage for … Read more

How to Use Spark Submit Command to Deploy

Spark Submit Command Tutorial

Running spark submit to deploy your application to an Apache Spark Cluster is a required step towards Apache Spark proficiency.  As covered elsewhere on this site, Spark can use a variety of orchestration components used in spark submit command deploys such as YARN-based Spark Cluster running in Cloudera, Hortonworks or MapR or even Kubernetes.  There … Read more

SparkSession, SparkContext, SQLContext in Spark [What’s the difference?]

How to choose between SparkContext, SQLContext and SparkSession

There have been some significant changes in the Apache Spark API over the years and when folks new to Spark begin reviewing source code examples, they will see references to SparkSession, SparkContext and SQLContext. Because this code looks so similar in design and purpose, users often ask questions such as “what’s the difference” and “why, … Read more

What is Apache Spark? An Essential Overview

What is Apache Spark?

Apache Spark is an open-source data processing engine designed for fast and big data processing. Originally developed at the University of California, Berkeley, in 2009, as an alternative to Hadoop MapReduce batch processing framework. Spark quickly became one of the most popular frameworks in big data analytics. Spark’s main advantage lies in its ability to … Read more

Spark RDD – A 2 Minute Guide for Beginners

Spark RDD

Spark RDD is short for Apache Spark Resilient Distributed Dataset.  A Spark Resilient Distributed Dataset is often shortened to simply Spark RDD.  RDDs are a foundational component of the Apache Spark large scale data processing framework. What is a Spark RDD? Spark RDDs are an immutable, fault-tolerant, and possibly distributed collection of data elements.  RDDs … Read more

Spark FAIR Scheduler Example

Spark FAIR Scheduler Example

Scheduling in Spark can be a confusing topic.  When someone says “scheduling” in Spark, do they mean scheduling applications running on the same cluster?  Or, do they mean the internal scheduling of Spark tasks within the Spark application?  So, before we cover an example of utilizing the Spark FAIR Scheduler, let’s make sure we’re on … Read more

Apache Spark Thrift Server Load Testing Example

Spark Thrift Server Stress Test Tutorial

Wondering how to do perform stress tests with Apache Spark Thrift Server?  This tutorial will describe one way to do it. What is Apache Spark Thrift Server?   Apache Spark Thrift Server is based on the Apache HiveServer2 which was created to allow JDBC/ODBC clients to execute SQL queries using a Spark Cluster.  From my … Read more

Spark Thrift Server with Cassandra Example

Spark thrift server with Cassandra

With the Spark Thrift Server, you can do more than you might have thought possible.  For example, want to use `joins` with Cassandra?  Or, help people familiar with SQL leverage your Spark infrastructure without having to learn Scala or Python?  They can use their existing SQL based tools they already know such as Tableau or … Read more

Spark Submit Command Line Arguments

Spark Command Line Arguments in Scala

The primary reason why we want to use Spark submit command line arguments is to avoid hard-coding values into our code. As we know, hard-coding should be avoided because it makes our application more rigid and less flexible. For example, let’s assume we want to run our Spark job in both test and production environments. … Read more