Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Setting up PySpark in Jupyter Notebook



Setting up PySpark in Jupyter Notebook

Jupyter Notebook is one of the best environments to write and experiment with PySpark code. It provides an interactive interface that is ideal for beginners learning data processing with Apache Spark.

What is PySpark?

PySpark is the Python API for Apache Spark. It allows you to harness the power of distributed computing using Python. You can use it to load data, transform it, and perform analytics across large datasets.

Why Use Jupyter with PySpark?

Prerequisites

Make sure you have the following installed on your system:

Step 1: Install Java

Apache Spark requires Java to run. If not already installed, you can download it from the official site or use the following command (Linux/Mac):


sudo apt install openjdk-11-jdk
    

Question:

Why is Java needed if we're coding in Python?

Answer:

Spark is written in Scala (which runs on the Java Virtual Machine). Even when we use PySpark, under the hood, it still relies on the JVM to execute tasks.

Step 2: Install Apache Spark

Download Spark from the official Apache Spark website: https://spark.apache.org/downloads

Extract the files and set environment variables in your shell config file (like .bashrc or .zshrc):


export SPARK_HOME=~/spark-3.5.0-bin-hadoop3
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
    

Step 3: Install Required Python Packages

Use pip to install the necessary libraries:


pip install pyspark
pip install notebook
    

Step 4: Configure PySpark for Jupyter

Now, configure environment variables so Jupyter knows how to find Spark. You can launch Jupyter using the following script:


PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
    

This command does two things:

Question:

What happens when I run pyspark?

Answer:

It initializes a Spark session and opens a shell (or notebook) connected to a Spark driver — ready to run your distributed data operations.

Step 5: Test PySpark in Jupyter Notebook

Once the notebook is open, create a new Python 3 notebook and test your PySpark setup:


from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder     .appName("PySpark Setup Test")     .getOrCreate()

# Create a simple DataFrame
data = [("Alice", 28), ("Bob", 35), ("Cathy", 23)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Show the DataFrame
df.show()
    
+-----+---+
| Name|Age|
+-----+---+
|Alice| 28|
|  Bob| 35|
|Cathy| 23|
+-----+---+
    

If you see the table output above, your setup is successful and Spark is working correctly in Jupyter Notebook.

Tips for Beginners

Summary

Setting up PySpark in Jupyter allows you to leverage the power of Apache Spark within an interactive and beginner-friendly environment. With this setup, you're ready to write Spark code in Python and explore large datasets seamlessly.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M