Installing Apache Spark on a Local Linux System

Apache Spark is a powerful big data processing engine. This guide will walk you through the step-by-step installation of Spark on a Linux-based system (like Ubuntu). By the end, you will have a working Spark setup you can use with the command line or PySpark in Jupyter notebooks.

Step 1: Check Java Installation

Spark requires Java (version 8 or above). Check if it's already installed:

java -version

Question:

Why does Spark need Java?

Answer:

Spark is written in Scala, which runs on the Java Virtual Machine (JVM). So Java is necessary to run Spark behind the scenes.

If Java is not installed, run the following to install it:

sudo apt update
sudo apt install openjdk-11-jdk -y

Step 2: Install Scala

Scala is Spark’s native language. While we will use PySpark (Python), Scala is needed internally by Spark.

sudo apt install scala -y

Verify installation:

scala -version

Step 3: Download and Extract Apache Spark

Go to the official Spark downloads page and copy the latest stable release link. Then run:

cd ~
wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xvzf spark-3.5.0-bin-hadoop3.tgz
mv spark-3.5.0-bin-hadoop3 spark

Step 4: Set Environment Variables

Add the following to your ~/.bashrc file:

# Spark
export SPARK_HOME=~/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_PYTHON=python3

Apply the changes:

source ~/.bashrc

Step 5: Verify Spark Installation

Now verify that Spark was installed successfully:

spark-shell

This opens the Spark shell using Scala.

To run with Python (PySpark):

pyspark

Question:

Why use PySpark if Spark is written in Scala?

Answer:

PySpark provides a Python API for Spark, making it easier for Python users to write distributed data processing code without learning Scala or Java.

Step 6 (Optional): Use Spark in Python Script

You can now use Spark directly in your Python files:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("SampleApp").getOrCreate()

# Create sample data
data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])

df.show()

+-----+---+
| Name|Age|
+-----+---+
|Alice| 25|
|  Bob| 30|
+-----+---+

Summary

You’ve successfully installed Apache Spark on your local Linux machine. You also learned about the dependencies like Java and Scala, and how to verify and run Spark through both Scala and Python shells. Now you're ready to start building powerful data pipelines using PySpark.

Installing Apache Spark on a Local Linux System