Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Installing Apache Spark on MacOS



Installing Apache Spark on MacOS

This guide will help you install Apache Spark on your Mac, set it up with Python (PySpark), and verify that everything works. No prior experience is needed. Just follow along step by step.

Why Install Spark Locally?

Installing Spark on your machine allows you to learn, experiment, and run distributed code even on a single system (using local mode). It's great for beginners because you can write and test Spark jobs without needing a cluster.

Step 1: Install Homebrew (if not already installed)

Homebrew is a package manager for MacOS. It simplifies the installation of software like Java, Python, and Spark.


/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
    

Question:

What is Homebrew and why do we use it?

Answer:

Homebrew is like an app store for your terminal. It makes installing developer tools on Mac much easier and safer.

Step 2: Install Java (OpenJDK)

Apache Spark runs on the Java Virtual Machine (JVM). Let’s install Java 11 using Homebrew:


brew install openjdk@11
    

After installing, link it to your environment:


sudo ln -sfn /opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk /Library/Java/JavaVirtualMachines/openjdk-11.jdk
echo 'export PATH="/opt/homebrew/opt/openjdk@11/bin:$PATH"' >> ~/.zprofile
source ~/.zprofile
    

Verify the Java version:


java -version
    
openjdk version "11.0.x" ...
    

Step 3: Install Apache Spark

Now we install Apache Spark using Homebrew:


brew install apache-spark
    

Add Spark to your environment variables:


echo 'export SPARK_HOME="/opt/homebrew/opt/apache-spark/libexec"' >> ~/.zprofile
echo 'export PATH="$SPARK_HOME/bin:$PATH"' >> ~/.zprofile
source ~/.zprofile
    

Verify Spark installation:


spark-shell
    
Using Scala version ... Welcome to Spark!
    

Question:

What is spark-shell and why does it use Scala?

Answer:

spark-shell is the interactive shell for Spark using Scala. Spark was originally written in Scala, so this shell is part of its core distribution.

Step 4: Set Up PySpark

PySpark allows you to use Spark with Python. First, make sure you have Python 3 installed (use Homebrew if needed).


brew install python
pip3 install pyspark
    

Now, launch PySpark:


pyspark
    
Python 3.x ... Using Spark version ...
Welcome to the PySpark shell!
    

Step 5: Run a Simple PySpark Program

Let’s write a simple PySpark program to count the number of words in a sentence.


from pyspark import SparkContext

sc = SparkContext("local", "WordCount")

data = ["Hello world", "Apache Spark is powerful", "Big Data is growing fast"]
rdd = sc.parallelize(data)

words = rdd.flatMap(lambda line: line.split(" "))
word_counts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

for word, count in word_counts.collect():
    print(f"{word}: {count}")

sc.stop()
    
Hello: 1
world: 1
Apache: 1
Spark: 1
is: 2
powerful: 1
Big: 1
Data: 1
growing: 1
fast: 1
    

Question:

Why do we use parallelize and not just a normal list?

Answer:

sc.parallelize() distributes data across Spark's workers for parallel processing. A normal list is local and can’t be distributed efficiently.

Summary

You are now ready to start building Spark applications locally on your Mac. The next step is learning Spark concepts like RDDs and DataFrames.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M