Running Your First Spark Application
Now that you have an understanding of what Apache Spark is, let’s move one step closer by running your first Spark application. In this lesson, we'll use PySpark — the Python API for Spark — to create a simple Spark job and understand its basic flow.
How Does a Spark Application Work?
A typical Spark application consists of a driver program that runs your main function and uses SparkContext to connect with cluster managers and execute tasks in parallel across executors.
Question:
Do I need a cluster to run a Spark application?
Answer:
No, you can run Spark on your local machine in “local mode” to develop and test your code. Later, the same code can scale to run on a real cluster with no or minimal changes.
Setting Up PySpark
Before we begin, ensure you have pyspark
installed. You can install it using pip:
pip install pyspark
Once installed, open your terminal or Jupyter Notebook to begin coding.
Writing Your First Spark Application
Let’s write a simple program that:
- Creates a Spark session
- Creates a list of numbers
- Converts the list into a Resilient Distributed Dataset (RDD)
- Performs a simple transformation
- Collects and prints the result
PySpark Code
from pyspark.sql import SparkSession
# Step 1: Create Spark Session
spark = SparkSession.builder
.appName("FirstSparkApp")
.master("local[*]")
.getOrCreate()
# Step 2: Get Spark Context from Session
sc = spark.sparkContext
# Step 3: Create a list and parallelize into RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
# Step 4: Apply transformation (square each element)
squared_rdd = rdd.map(lambda x: x * x)
# Step 5: Collect the result
result = squared_rdd.collect()
# Step 6: Print the output
print("Squared Numbers:", result)
# Stop Spark Session
spark.stop()
Output
Squared Numbers: [1, 4, 9, 16, 25]
Step-by-Step Code Explanation
- SparkSession: Entry point for using DataFrame and SQL APIs. It also gives access to SparkContext, which is used to work with RDDs.
- parallelize(): Converts a normal Python list into a distributed dataset (RDD).
- map(): A transformation that applies a function to each element of the RDD.
- collect(): An action that retrieves the entire result from the distributed dataset back to the driver program.
Question:
Why do we need collect()
at the end?
Answer:
Until you call an action like collect()
or count()
, Spark doesn’t actually execute anything. This is because Spark uses lazy evaluation — it builds a plan of transformations first, and only when an action is called does it execute the operations.
Running This Code
You can run this script:
- From a terminal using:
python my_spark_app.py
- Inside Jupyter Notebook after installing pyspark
- Or via an IDE like PyCharm or VS Code
Summary
Congratulations! You've just run your first Apache Spark application. You created an RDD, performed a transformation, and executed an action. As we move forward, we’ll explore working with structured data using DataFrames, SQL, and machine learning pipelines.