⬅ Previous Topic
Installing Apache Spark on a Local Linux SystemNext Topic ⮕
Running Your First Spark Application⬅ Previous Topic
Installing Apache Spark on a Local Linux SystemNext Topic ⮕
Running Your First Spark ApplicationThis guide will help you install Apache Spark on your Mac, set it up with Python (PySpark), and verify that everything works. No prior experience is needed. Just follow along step by step.
Installing Spark on your machine allows you to learn, experiment, and run distributed code even on a single system (using local mode). It's great for beginners because you can write and test Spark jobs without needing a cluster.
Homebrew is a package manager for MacOS. It simplifies the installation of software like Java, Python, and Spark.
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
What is Homebrew and why do we use it?
Homebrew is like an app store for your terminal. It makes installing developer tools on Mac much easier and safer.
Apache Spark runs on the Java Virtual Machine (JVM). Let’s install Java 11 using Homebrew:
brew install openjdk@11
After installing, link it to your environment:
sudo ln -sfn /opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdk
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zprofile
source ~/.zprofile
Verify the Java version:
java -version
openjdk version "11.0.x" ...
Now we install Apache Spark using Homebrew:
brew install apache-spark
Add Spark to your environment variables:
echo 'export SPARK_HOME="/opt/homebrew/opt/apache-spark/libexec"' >> ~/.zprofile
echo 'export PATH="$SPARK_HOME/bin:$PATH"' >> ~/.zprofile
source ~/.zprofile
Verify Spark installation:
spark-shell
Using Scala version ... Welcome to Spark!
What is spark-shell
and why does it use Scala?
spark-shell
is the interactive shell for Spark using Scala. Spark was originally written in Scala, so this shell is part of its core distribution.
PySpark allows you to use Spark with Python. First, make sure you have Python 3 installed (use Homebrew if needed).
brew install python
pip3 install pyspark
Now, launch PySpark:
pyspark
Python 3.x ... Using Spark version ... Welcome to the PySpark shell!
Let’s write a simple PySpark program to count the number of words in a sentence.
from pyspark import SparkContext
sc = SparkContext("local", "WordCount")
data = ["Hello world", "Apache Spark is powerful", "Big Data is growing fast"]
rdd = sc.parallelize(data)
words = rdd.flatMap(lambda line: line.split(" "))
word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
for word, count in word_counts.collect():
print(f"{word}: {count}")
sc.stop()
Hello: 1 world: 1 Apache: 1 Spark: 1 is: 2 powerful: 1 Big: 1 Data: 1 growing: 1 fast: 1
Why do we use parallelize
and not just a normal list?
sc.parallelize()
distributes data across Spark's workers for parallel processing. A normal list is local and can’t be distributed efficiently.
spark-shell
and pyspark
You are now ready to start building Spark applications locally on your Mac. The next step is learning Spark concepts like RDDs and DataFrames.
⬅ Previous Topic
Installing Apache Spark on a Local Linux SystemNext Topic ⮕
Running Your First Spark ApplicationYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.