1. Intro to Spark
According to Apache spark there one line definition is “A fast and general engine for large scale data processing”.
That’s actually a good summary of what it’s all about. Spark can manage to process a massive amount of data that can represent anything weblogs, live data or it could be anything.
A high-Level overview of Data processing
Spark can distribute the data amongst the cluster of computers and do the processing, remember it’s only an execution engine it has no storage.
Spark can collect data from various sources either real-time or batch processing it can also store data on different platforms.
Spark runs on top of a cluster manager which helps it to scale and distribute the work between master nodes and worker nodes. These cluster managers could be Standalone, YASOS, and YARN.
The cluster manager coordinates and distributes the work with different executors along with the coordination of the Driver Program.
It’s scalable and provides fault tolerance so if one of the executors goes down it can recover and performs the job without starting it all over again.
Data Transformation & Actions
Spark core helps us to do the data processing part by creating Dataframes, RDD, and by performing the transformation, actions on it.
Why Transformation & Actions are required?
To convert our Data into meaningful ways as per business usecase. We usually get raw data that can contain many unwanted things so we need to do some modifications to it, and these modifications are called Transformation and Actions.
How do we perform Data Transformation?
We create a DataFrame/RDD using our Data(Data could be anywhere/any format) and perform the modifications as per our requirements.
Spark Job Stages and Tasks
Whatever code we write in Spark for our Data processing, Spark internally translates the program into Job, Stages, and tasks.
Spark also creates a DAG of process or creates an optimized plan for execution.
Dataframes are immutable, we can’t change it once it is created. So whatever actions we perform into our Dataframe will get convert into Jobs.
Jobs will further be divided into Stages and Stages with Tasks.
You can check all these Jobs, Stages, and tasks from the WebUI of spark.
You can read more about spark arch and other components on their official site https://spark.apache.org/docs/latest/cluster-overview.html
Spark also provides other components/libraries like Spark Streaming, Spark SQL, MLLib, and GraphX.
2. Spark Installation on Mac
The easiest way to install spark on your Mac is by using Homebrew, you can install it by running a few simple commands.
2.1.1 Spark installation using Brew
execute it from your terminal and it will install the homebrew for you.
/bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
2.2.2 Install apache-spark
brew install apache-spark
2.3 Test the installation
If you have not modified anything then, it would have installed at this path /usr/local/Cellar/apache-spark/3.0.1
If the installation is done properly then if you write pyspark and you will get a screen like below.
Here we are testing our installation by using README.md file, we are counting the number of rows in readme.md file.
2.2 Using Apache Spark binaries
2.2.1 Install Java 11 & Python3
Installing Java11 is required if you want to use the latest Spark version, you can simply install it by brew.
# brew install java11
After installation, you need to set the ENV variable.
#java -version(Gives you the version of installed Java Version)
To set this JAVA_HOME permanently, add this path to your startup script.
Add the JAVA_HOME path in your .zshrc file.
#brew install python3
This command will install the python for you, later you can check the version by python3 -version.
#Add this into your .zshrc file, same as above.
2.2.2 Download & Install Apache Spark
Download the Apache Spark from their official website. once downloaded you will get a file and you need to untar it like below.
Untar the downloaded file, I have created a separate folder spark3 and extracted it there.
export and add spark_home path and spark_home/bin path into your .zshrc file.
If you have followed it till here, then your .zshrc file should look like this
Write pyspark or spark-shell to check your installation, if everything is done the same as above then you will get the pyspark command line with python version 3.
Refer to this website for detailed info
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala…
Another small project can be found here, where I have used data frames and SQL operations to get our result.