⬅ Previous Topic
Installing Apache Spark on Windows (Step-by-Step Guide)Next Topic ⮕
Installing Apache Spark on MacOS⬅ Previous Topic
Installing Apache Spark on Windows (Step-by-Step Guide)Next Topic ⮕
Installing Apache Spark on MacOSApache Spark is a powerful big data processing engine. This guide will walk you through the step-by-step installation of Spark on a Linux-based system (like Ubuntu). By the end, you will have a working Spark setup you can use with the command line or PySpark in Jupyter notebooks.
Spark requires Java (version 8 or above). Check if it's already installed:
java -version
Why does Spark need Java?
Spark is written in Scala, which runs on the Java Virtual Machine (JVM). So Java is necessary to run Spark behind the scenes.
If Java is not installed, run the following to install it:
sudo apt update
sudo apt install openjdk-11-jdk -y
Scala is Spark’s native language. While we will use PySpark (Python), Scala is needed internally by Spark.
sudo apt install scala -y
Verify installation:
scala -version
Go to the official Spark downloads page and copy the latest stable release link. Then run:
cd ~
wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xvzf spark-3.5.0-bin-hadoop3.tgz
mv spark-3.5.0-bin-hadoop3 spark
Add the following to your ~/.bashrc
file:
# Spark
export SPARK_HOME=~/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON=python3
Apply the changes:
source ~/.bashrc
Now verify that Spark was installed successfully:
spark-shell
This opens the Spark shell using Scala.
To run with Python (PySpark):
pyspark
Why use PySpark if Spark is written in Scala?
PySpark provides a Python API for Spark, making it easier for Python users to write distributed data processing code without learning Scala or Java.
You can now use Spark directly in your Python files:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("SampleApp").getOrCreate()
# Create sample data
data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
+-----+---+ | Name|Age| +-----+---+ |Alice| 25| | Bob| 30| +-----+---+
You’ve successfully installed Apache Spark on your local Linux machine. You also learned about the dependencies like Java and Scala, and how to verify and run Spark through both Scala and Python shells. Now you're ready to start building powerful data pipelines using PySpark.
⬅ Previous Topic
Installing Apache Spark on Windows (Step-by-Step Guide)Next Topic ⮕
Installing Apache Spark on MacOSYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.