Running Your First Spark Application

Now that you have an understanding of what Apache Spark is, let’s move one step closer by running your first Spark application. In this lesson, we'll use PySpark — the Python API for Spark — to create a simple Spark job and understand its basic flow.

How Does a Spark Application Work?

A typical Spark application consists of a driver program that runs your main function and uses SparkContext to connect with cluster managers and execute tasks in parallel across executors.

Question:

Do I need a cluster to run a Spark application?

Answer:

No, you can run Spark on your local machine in “local mode” to develop and test your code. Later, the same code can scale to run on a real cluster with no or minimal changes.

Setting Up PySpark

Before we begin, ensure you have pyspark installed. You can install it using pip:

pip install pyspark

Once installed, open your terminal or Jupyter Notebook to begin coding.

Writing Your First Spark Application

Let’s write a simple program that:

Creates a Spark session
Creates a list of numbers
Converts the list into a Resilient Distributed Dataset (RDD)
Performs a simple transformation
Collects and prints the result

PySpark Code

from pyspark.sql import SparkSession

# Step 1: Create Spark Session
spark = SparkSession.builder
    .appName("FirstSparkApp")
    .master("local[*]")
    .getOrCreate()

# Step 2: Get Spark Context from Session
sc = spark.sparkContext

# Step 3: Create a list and parallelize into RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Step 4: Apply transformation (square each element)
squared_rdd = rdd.map(lambda x: x * x)

# Step 5: Collect the result
result = squared_rdd.collect()

# Step 6: Print the output
print("Squared Numbers:", result)

# Stop Spark Session
spark.stop()

Squared Numbers: [1, 4, 9, 16, 25]

Step-by-Step Code Explanation

SparkSession: Entry point for using DataFrame and SQL APIs. It also gives access to SparkContext, which is used to work with RDDs.
parallelize(): Converts a normal Python list into a distributed dataset (RDD).
map(): A transformation that applies a function to each element of the RDD.
collect(): An action that retrieves the entire result from the distributed dataset back to the driver program.

Question:

Why do we need collect() at the end?

Answer:

Until you call an action like collect() or count(), Spark doesn’t actually execute anything. This is because Spark uses lazy evaluation — it builds a plan of transformations first, and only when an action is called does it execute the operations.

Running This Code

You can run this script:

From a terminal using: python my_spark_app.py
Inside Jupyter Notebook after installing pyspark
Or via an IDE like PyCharm or VS Code

Summary

Congratulations! You've just run your first Apache Spark application. You created an RDD, performed a transformation, and executed an action. As we move forward, we’ll explore working with structured data using DataFrames, SQL, and machine learning pipelines.

Running Your First Spark Application