⬅ Previous Topic
Selecting, Filtering, and Transforming Data in Spark DataFramesNext Topic ⮕
Basic DataFrame Operations in PySpark⬅ Previous Topic
Selecting, Filtering, and Transforming Data in Spark DataFramesNext Topic ⮕
Basic DataFrame Operations in PySparkJupyter Notebook is one of the best environments to write and experiment with PySpark code. It provides an interactive interface that is ideal for beginners learning data processing with Apache Spark.
PySpark is the Python API for Apache Spark. It allows you to harness the power of distributed computing using Python. You can use it to load data, transform it, and perform analytics across large datasets.
Make sure you have the following installed on your system:
Apache Spark requires Java to run. If not already installed, you can download it from the official site or use the following command (Linux/Mac):
sudo apt install openjdk-11-jdk
Why is Java needed if we're coding in Python?
Spark is written in Scala (which runs on the Java Virtual Machine). Even when we use PySpark, under the hood, it still relies on the JVM to execute tasks.
Download Spark from the official Apache Spark website: https://spark.apache.org/downloads
Extract the files and set environment variables in your shell config file (like .bashrc
or .zshrc
):
export SPARK_HOME=~/spark-3.5.0-bin-hadoop3
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3
Use pip
to install the necessary libraries:
pip install pyspark
pip install notebook
Now, configure environment variables so Jupyter knows how to find Spark. You can launch Jupyter using the following script:
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
This command does two things:
What happens when I run pyspark
?
It initializes a Spark session and opens a shell (or notebook) connected to a Spark driver — ready to run your distributed data operations.
Once the notebook is open, create a new Python 3 notebook and test your PySpark setup:
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder .appName("PySpark Setup Test") .getOrCreate()
# Create a simple DataFrame
data = [("Alice", 28), ("Bob", 35), ("Cathy", 23)]
df = spark.createDataFrame(data, ["Name", "Age"])
# Show the DataFrame
df.show()
+-----+---+ | Name|Age| +-----+---+ |Alice| 28| | Bob| 35| |Cathy| 23| +-----+---+
If you see the table output above, your setup is successful and Spark is working correctly in Jupyter Notebook.
df.printSchema()
to understand the structure of dataspark.stop()
when you're done to free resourcesSetting up PySpark in Jupyter allows you to leverage the power of Apache Spark within an interactive and beginner-friendly environment. With this setup, you're ready to write Spark code in Python and explore large datasets seamlessly.
⬅ Previous Topic
Selecting, Filtering, and Transforming Data in Spark DataFramesNext Topic ⮕
Basic DataFrame Operations in PySparkYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.