Setting up PySpark in Jupyter Notebook

Jupyter Notebook is one of the best environments to write and experiment with PySpark code. It provides an interactive interface that is ideal for beginners learning data processing with Apache Spark.

What is PySpark?

PySpark is the Python API for Apache Spark. It allows you to harness the power of distributed computing using Python. You can use it to load data, transform it, and perform analytics across large datasets.

Why Use Jupyter with PySpark?

Interactive development
Step-by-step execution and immediate feedback
Easy visualization and experimentation

Prerequisites

Make sure you have the following installed on your system:

Python 3.x
Java 8 or 11 (required by Spark)
Apache Spark
Jupyter Notebook

Step 1: Install Java

Apache Spark requires Java to run. If not already installed, you can download it from the official site or use the following command (Linux/Mac):

sudo apt install openjdk-11-jdk

Question:

Why is Java needed if we're coding in Python?

Answer:

Spark is written in Scala (which runs on the Java Virtual Machine). Even when we use PySpark, under the hood, it still relies on the JVM to execute tasks.

Step 2: Install Apache Spark

Download Spark from the official Apache Spark website: https://spark.apache.org/downloads

Extract the files and set environment variables in your shell config file (like .bashrc or .zshrc):

export SPARK_HOME=~/spark-3.5.0-bin-hadoop3
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3

Step 3: Install Required Python Packages

Use pip to install the necessary libraries:

pip install pyspark
pip install notebook

Step 4: Configure PySpark for Jupyter

Now, configure environment variables so Jupyter knows how to find Spark. You can launch Jupyter using the following script:

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark

This command does two things:

Starts Jupyter Notebook
Configures it to run PySpark in each notebook cell

Question:

What happens when I run pyspark?

Answer:

It initializes a Spark session and opens a shell (or notebook) connected to a Spark driver — ready to run your distributed data operations.

Step 5: Test PySpark in Jupyter Notebook

Once the notebook is open, create a new Python 3 notebook and test your PySpark setup:

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder     .appName("PySpark Setup Test")     .getOrCreate()

# Create a simple DataFrame
data = [("Alice", 28), ("Bob", 35), ("Cathy", 23)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Show the DataFrame
df.show()

+-----+---+
| Name|Age|
+-----+---+
|Alice| 28|
|  Bob| 35|
|Cathy| 23|
+-----+---+

If you see the table output above, your setup is successful and Spark is working correctly in Jupyter Notebook.

Tips for Beginners

Start with small datasets while practicing
Use df.printSchema() to understand the structure of data
Use spark.stop() when you're done to free resources

Summary

Setting up PySpark in Jupyter allows you to leverage the power of Apache Spark within an interactive and beginner-friendly environment. With this setup, you're ready to write Spark code in Python and explore large datasets seamlessly.

⬅ Previous TopicSelecting, Filtering, and Transforming Data in Spark DataFrames

Next Topic ⮕Basic DataFrame Operations in PySpark

Setting up PySpark in Jupyter Notebook