Installing Apache Spark on a Local Linux System
Apache Spark is a powerful big data processing engine. This guide will walk you through the step-by-step installation of Spark on a Linux-based system (like Ubuntu). By the end, you will have a working Spark setup you can use with the command line or PySpark in Jupyter notebooks.
Step 1: Check Java Installation
Spark requires Java (version 8 or above). Check if it's already installed:
java -version
Question:
Why does Spark need Java?
Answer:
Spark is written in Scala, which runs on the Java Virtual Machine (JVM). So Java is necessary to run Spark behind the scenes.
If Java is not installed, run the following to install it:
sudo apt update
sudo apt install openjdk-11-jdk -y
Step 2: Install Scala
Scala is Spark’s native language. While we will use PySpark (Python), Scala is needed internally by Spark.
sudo apt install scala -y
Verify installation:
scala -version
Step 3: Download and Extract Apache Spark
Go to the official Spark downloads page and copy the latest stable release link. Then run:
cd ~
wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xvzf spark-3.5.0-bin-hadoop3.tgz
mv spark-3.5.0-bin-hadoop3 spark
Step 4: Set Environment Variables
Add the following to your ~/.bashrc
file:
# Spark
export SPARK_HOME=~/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON=python3
Apply the changes:
source ~/.bashrc
Step 5: Verify Spark Installation
Now verify that Spark was installed successfully:
spark-shell
This opens the Spark shell using Scala.
To run with Python (PySpark):
pyspark
Question:
Why use PySpark if Spark is written in Scala?
Answer:
PySpark provides a Python API for Spark, making it easier for Python users to write distributed data processing code without learning Scala or Java.
Step 6 (Optional): Use Spark in Python Script
You can now use Spark directly in your Python files:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("SampleApp").getOrCreate()
# Create sample data
data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
+-----+---+ | Name|Age| +-----+---+ |Alice| 25| | Bob| 30| +-----+---+
Summary
You’ve successfully installed Apache Spark on your local Linux machine. You also learned about the dependencies like Java and Scala, and how to verify and run Spark through both Scala and Python shells. Now you're ready to start building powerful data pipelines using PySpark.