Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Running Your First Spark Application



Running Your First Spark Application

Now that you have an understanding of what Apache Spark is, let’s move one step closer by running your first Spark application. In this lesson, we'll use PySpark — the Python API for Spark — to create a simple Spark job and understand its basic flow.

How Does a Spark Application Work?

A typical Spark application consists of a driver program that runs your main function and uses SparkContext to connect with cluster managers and execute tasks in parallel across executors.

Question:

Do I need a cluster to run a Spark application?

Answer:

No, you can run Spark on your local machine in “local mode” to develop and test your code. Later, the same code can scale to run on a real cluster with no or minimal changes.

Setting Up PySpark

Before we begin, ensure you have pyspark installed. You can install it using pip:


pip install pyspark
    

Once installed, open your terminal or Jupyter Notebook to begin coding.

Writing Your First Spark Application

Let’s write a simple program that:

  1. Creates a Spark session
  2. Creates a list of numbers
  3. Converts the list into a Resilient Distributed Dataset (RDD)
  4. Performs a simple transformation
  5. Collects and prints the result

PySpark Code


from pyspark.sql import SparkSession

# Step 1: Create Spark Session
spark = SparkSession.builder
    .appName("FirstSparkApp")
    .master("local[*]")
    .getOrCreate()

# Step 2: Get Spark Context from Session
sc = spark.sparkContext

# Step 3: Create a list and parallelize into RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Step 4: Apply transformation (square each element)
squared_rdd = rdd.map(lambda x: x * x)

# Step 5: Collect the result
result = squared_rdd.collect()

# Step 6: Print the output
print("Squared Numbers:", result)

# Stop Spark Session
spark.stop()
    

Output

Squared Numbers: [1, 4, 9, 16, 25]
    

Step-by-Step Code Explanation

Question:

Why do we need collect() at the end?

Answer:

Until you call an action like collect() or count(), Spark doesn’t actually execute anything. This is because Spark uses lazy evaluation — it builds a plan of transformations first, and only when an action is called does it execute the operations.

Running This Code

You can run this script:

  1. From a terminal using: python my_spark_app.py
  2. Inside Jupyter Notebook after installing pyspark
  3. Or via an IDE like PyCharm or VS Code

Summary

Congratulations! You've just run your first Apache Spark application. You created an RDD, performed a transformation, and executed an action. As we move forward, we’ll explore working with structured data using DataFrames, SQL, and machine learning pipelines.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M