Installing Apache Spark on MacOS
This guide will help you install Apache Spark on your Mac, set it up with Python (PySpark), and verify that everything works. No prior experience is needed. Just follow along step by step.
Why Install Spark Locally?
Installing Spark on your machine allows you to learn, experiment, and run distributed code even on a single system (using local mode). It's great for beginners because you can write and test Spark jobs without needing a cluster.
Step 1: Install Homebrew (if not already installed)
Homebrew is a package manager for MacOS. It simplifies the installation of software like Java, Python, and Spark.
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Question:
What is Homebrew and why do we use it?
Answer:
Homebrew is like an app store for your terminal. It makes installing developer tools on Mac much easier and safer.
Step 2: Install Java (OpenJDK)
Apache Spark runs on the Java Virtual Machine (JVM). Let’s install Java 11 using Homebrew:
brew install openjdk@11
After installing, link it to your environment:
sudo ln -sfn /opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdk
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zprofile
source ~/.zprofile
Verify the Java version:
java -version
openjdk version "11.0.x" ...
Step 3: Install Apache Spark
Now we install Apache Spark using Homebrew:
brew install apache-spark
Add Spark to your environment variables:
echo 'export SPARK_HOME="/opt/homebrew/opt/apache-spark/libexec"' >> ~/.zprofile
echo 'export PATH="$SPARK_HOME/bin:$PATH"' >> ~/.zprofile
source ~/.zprofile
Verify Spark installation:
spark-shell
Using Scala version ... Welcome to Spark!
Question:
What is spark-shell
and why does it use Scala?
Answer:
spark-shell
is the interactive shell for Spark using Scala. Spark was originally written in Scala, so this shell is part of its core distribution.
Step 4: Set Up PySpark
PySpark allows you to use Spark with Python. First, make sure you have Python 3 installed (use Homebrew if needed).
brew install python
pip3 install pyspark
Now, launch PySpark:
pyspark
Python 3.x ... Using Spark version ... Welcome to the PySpark shell!
Step 5: Run a Simple PySpark Program
Let’s write a simple PySpark program to count the number of words in a sentence.
from pyspark import SparkContext
sc = SparkContext("local", "WordCount")
data = ["Hello world", "Apache Spark is powerful", "Big Data is growing fast"]
rdd = sc.parallelize(data)
words = rdd.flatMap(lambda line: line.split(" "))
word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
for word, count in word_counts.collect():
print(f"{word}: {count}")
sc.stop()
Hello: 1 world: 1 Apache: 1 Spark: 1 is: 2 powerful: 1 Big: 1 Data: 1 growing: 1 fast: 1
Question:
Why do we use parallelize
and not just a normal list?
Answer:
sc.parallelize()
distributes data across Spark's workers for parallel processing. A normal list is local and can’t be distributed efficiently.
Summary
- We installed Java, Spark, and Python on Mac using Homebrew
- Verified Spark using both
spark-shell
andpyspark
- Ran a simple word count job in PySpark to validate everything works
You are now ready to start building Spark applications locally on your Mac. The next step is learning Spark concepts like RDDs and DataFrames.