Intro to Apache Spark, Installation & Running a test project

Basic Intro and Installation on MAC

1. Intro to Spark

2. Spark Installation on Mac

2.1 Spark installation using Brew

2.2 Using Apache Spark binaries

1. Intro to Spark

According to Apache spark there one line definition is “A fast and general engine for large scale data processing”.

That’s actually a good summary of what it’s all about. Spark can manage to process a massive amount of data that can represent anything weblogs, live data or it could be anything.

A high-Level overview of Data processing

Read===>Process===>Write

Spark can distribute the data amongst the cluster of computers and do the processing, remember it’s only an execution engine it has no storage.

Spark can collect data from various sources either real-time or batch processing it can also store data on different platforms.

Spark runs on top of a cluster manager which helps it to scale and distribute the work between master nodes and worker nodes. These cluster managers could be Standalone, YASOS, and YARN.

The cluster manager coordinates and distributes the work with different executors along with the coordination of the Driver Program.

It’s scalable and provides fault tolerance so if one of the executors goes down it can recover and performs the job without starting it all over again.

Data Transformation & Actions

Spark core helps us to do the data processing part by creating Dataframes, RDD, and by performing the transformation, actions on it.

Why Transformation & Actions are required?

To convert our Data into meaningful ways as per business usecase. We usually get raw data that can contain many unwanted things so we need to do some modifications to it, and these modifications are called Transformation and Actions.

How do we perform Data Transformation?

We create a DataFrame/RDD using our Data(Data could be anywhere/any format) and perform the modifications as per our requirements.

Spark Job Stages and Tasks

Whatever code we write in Spark for our Data processing, Spark internally translates the program into Job, Stages, and tasks.

Spark also creates a DAG of process or creates an optimized plan for execution.

Dataframes are immutable, we can’t change it once it is created. So whatever actions we perform into our Dataframe will get convert into Jobs.

Jobs will further be divided into Stages and Stages with Tasks.

Program==>Jobs==>Stages==>Tasks

You can check all these Jobs, Stages, and tasks from the WebUI of spark.

Spark WebUI, no processes are running at the moment so it’s blank.

You can read more about spark arch and other components on their official site https://spark.apache.org/docs/latest/cluster-overview.html

Spark also provides other components/libraries like Spark Streaming, Spark SQL, MLLib, and GraphX.

2. Spark Installation on Mac

The easiest way to install spark on your Mac is by using Homebrew, you can install it by running a few simple commands.

2.1.1 Spark installation using Brew

execute it from your terminal and it will install the homebrew for you.

/bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

2.2.2 Install apache-spark

brew install apache-spark

2.3 Test the installation

If you have not modified anything then, it would have installed at this path /usr/local/Cellar/apache-spark/3.0.1

If the installation is done properly then if you write pyspark and you will get a screen like below.

Here we are testing our installation by using README.md file, we are counting the number of rows in readme.md file.

2.2 Using Apache Spark binaries

2.2.1 Install Java 11 & Python3

Installing Java11 is required if you want to use the latest Spark version, you can simply install it by brew.

# brew install java11

After installation, you need to set the ENV variable.

#java -version(Gives you the version of installed Java Version)

#export JAVA_HOME=/usr/local/Cellar/openjdk@11/11.0.9

To set this JAVA_HOME permanently, add this path to your startup script.

Add the JAVA_HOME path in your .zshrc file.

#brew install python3

This command will install the python for you, later you can check the version by python3 -version.

#export PYSPARK_PYTHON=python3

#Add this into your .zshrc file, same as above.

2.2.2 Download & Install Apache Spark

Download the Apache Spark from their official website. once downloaded you will get a file and you need to untar it like below.

Untar the downloaded file, I have created a separate folder spark3 and extracted it there.

export and add spark_home path and spark_home/bin path into your .zshrc file.

If you have followed it till here, then your .zshrc file should look like this

Write pyspark or spark-shell to check your installation, if everything is done the same as above then you will get the pyspark command line with python version 3.

Data Engineer | Vattenfall | Sweden | LTI