Installing Apache Spark on MacOS

This guide will help you install Apache Spark on your Mac, set it up with Python (PySpark), and verify that everything works. No prior experience is needed. Just follow along step by step.

Why Install Spark Locally?

Installing Spark on your machine allows you to learn, experiment, and run distributed code even on a single system (using local mode). It's great for beginners because you can write and test Spark jobs without needing a cluster.

Step 1: Install Homebrew (if not already installed)

Homebrew is a package manager for MacOS. It simplifies the installation of software like Java, Python, and Spark.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Question:

What is Homebrew and why do we use it?

Answer:

Homebrew is like an app store for your terminal. It makes installing developer tools on Mac much easier and safer.

Step 2: Install Java (OpenJDK)

Apache Spark runs on the Java Virtual Machine (JVM). Let’s install Java 11 using Homebrew:

brew install openjdk@11

After installing, link it to your environment:

sudo ln -sfn /opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdk
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zprofile
source ~/.zprofile

Verify the Java version:

java -version

openjdk version "11.0.x" ...

Step 3: Install Apache Spark

Now we install Apache Spark using Homebrew:

brew install apache-spark

Add Spark to your environment variables:

echo 'export SPARK_HOME="/opt/homebrew/opt/apache-spark/libexec"' >> ~/.zprofile
echo 'export PATH="$SPARK_HOME/bin:$PATH"' >> ~/.zprofile
source ~/.zprofile

Verify Spark installation:

spark-shell

Using Scala version ... Welcome to Spark!

Question:

What is spark-shell and why does it use Scala?

Answer:

spark-shell is the interactive shell for Spark using Scala. Spark was originally written in Scala, so this shell is part of its core distribution.

Step 4: Set Up PySpark

PySpark allows you to use Spark with Python. First, make sure you have Python 3 installed (use Homebrew if needed).

brew install python
pip3 install pyspark

Now, launch PySpark:

pyspark

Python 3.x ... Using Spark version ...
Welcome to the PySpark shell!

Step 5: Run a Simple PySpark Program

Let’s write a simple PySpark program to count the number of words in a sentence.

from pyspark import SparkContext

sc = SparkContext("local", "WordCount")

data = ["Hello world", "Apache Spark is powerful", "Big Data is growing fast"]
rdd = sc.parallelize(data)

words = rdd.flatMap(lambda line: line.split(" "))
word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

for word, count in word_counts.collect():
    print(f"{word}: {count}")

sc.stop()

Hello: 1
world: 1
Apache: 1
Spark: 1
is: 2
powerful: 1
Big: 1
Data: 1
growing: 1
fast: 1

Question:

Why do we use parallelize and not just a normal list?

Answer:

sc.parallelize() distributes data across Spark's workers for parallel processing. A normal list is local and can’t be distributed efficiently.

Summary

We installed Java, Spark, and Python on Mac using Homebrew
Verified Spark using both spark-shell and pyspark
Ran a simple word count job in PySpark to validate everything works

You are now ready to start building Spark applications locally on your Mac. The next step is learning Spark concepts like RDDs and DataFrames.

Installing Apache Spark on MacOS