A guide to learning Apache Spark and Databricks Spark Certification

Sidharth Singh
3 min readJun 28, 2020
Spark application execution in client mode

TL;DR: Want to learn Apache Spark using Python by doing ?

Step 1. Go to https://github.com/Realsid/databricks-spark-certification

Step 2. Clone the repository

Step 3. Create Databricks community edition account

Step 4. Upload .DBC file

Step 5. Create Cluster

Step 6. Execute code one cell at a time !

Hey all ! In this post I will walk you through how you can learn Apache Spark using Python and finally become a Databricks Certified Spark developer.

What’s Apache Spark ?

Apache Spark is a unified computing engine and set of libraries for parallel data processing on computer clusters. This definition has 3 components that have been broken down below:

  1. Unified: Spark can deal with tasks ranging from simple data loading to machine learning and streaming data processing at scale
  2. Computing engine: Spark is not a storage system like Hadoop but it’s features are leveraged for transforming copious amounts of data at high speed. Spark integrates with a lot of persistent storage systems like Apache Hadoop, Amazon S3, Azure Data Lake etc.
  3. Libraries: Spark includes libraries for SQL and structured data, machine learning, stream processing and graph analytics

So just to begin with you can think of it as SQL/Pandas( If you are familiar with Python) for very very huge datasets ! (This is understatement of huge proportions but it will do for now)

What’s Databricks ?

Databricks is the company behind Apache Spark. It provides a cloud based environment for one to run their Spark application while taking care of infrastructure setup. Databricks also provide certification opportunity for developers to prove their credentials. To help with that I have built a guide that covers all the major topics that are needed to be studied to clear this certification.

Ok, you have my attention. How do I get started ?

For the purpose of this project we will be using Databricks community edition account to execute the workbooks so make sure you have one. To get started:

  1. Got to Github and clone the following repository: https://github.com/Realsid/databricks-spark-certification
  2. Log into Databricks community edition and click on import

3. Click on file and browse

4. Upload the dbc/databricks-spark-certification.dbc file from the cloned repository

5. Start a cluster

6. Attach notebook to the cluster and execute the cells one at a time (Do read all the comments and notes)

What is included ?

The project contains Databricks notebooks on the following topics:

  1. Spark Architecture Components
  2. Spark Execution
  3. DataFrames API: SparkContext
  4. DataFrames API: SparkSession
  5. DataFrames API: DataFrameReader
  6. DataFrames API: DataFrameWriter
  7. DataFrames API: DataFrame Part II
  8. DataFrames API: DataFrame Part II
  9. DataFrames API: DataFrame Part III
  10. Spark SQL Functions Part I
  11. Spark SQL Functions Part II

Over this I have tried to incorporate different exercises for you to practice on along with their solutions.

Happy Learning :)

EDIT : Recently I came to know about this practice test at Udemy which I found out to be extremely helpful. It closely resembles the certification exam questions and will surely give you confidence to attempt the real one.

--

--

Sidharth Singh

Writings on mistakes, learning and love of finding by experimenting.