Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Job, Stage, and Task in Apache Spark



Job, Stage, and Task in Apache Spark

To understand how Apache Spark executes your code behind the scenes, it’s essential to break down the three core components of its execution plan: Jobs, Stages, and Tasks. These are the building blocks of Spark's distributed processing engine.

What is a Job?

A Job is triggered when an action is called on a DataFrame or RDD — like count(), collect(), or save(). It's the top-level unit of work submitted to Spark by the driver program.

Example:


from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("JobExample").getOrCreate()

df = spark.read.csv("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv", header=True)
df_filtered = df.filter(df["1958"] > 350)
df_filtered.show()  # This is an action that triggers a Job
    
+-----+-----+-----+-----+
|Month|1958 |1959 |1960 |
+-----+-----+-----+-----+
| MAR | 362 | 406 | 419 |
| APR | 348 | 396 | 461 |
| MAY | 363 | 420 | 472 |
+-----+-----+-----+-----+
    

In this case, the show() method is the action. When it's called, Spark creates a Job to compute the filtered DataFrame and display it.

Question:

When does Spark actually run the code?

Answer:

Only when an action is called. Until then, all transformations like filter() are lazily evaluated and stored as a lineage graph.

What is a Stage?

A Stage is a logical group of tasks that can be executed together. Spark breaks each job into stages based on the presence of wide transformations like groupBy() or join(), which require data shuffling between partitions.

Stages are of two types:

Example:


from pyspark.sql.functions import avg

# Assume df is the same from earlier
df_grouped = df.groupBy("Month").agg(avg("1958").alias("avg_1958"))
df_grouped.show()
    
+-----+----------+
|Month|avg_1958  |
+-----+----------+
| JAN | 340.0    |
| FEB | 318.0    |
| MAR | 362.0    |
+-----+----------+
    

Here, groupBy() introduces a shuffle. So, the job is broken into at least two stages:

  1. Stage 1: Read and map records.
  2. Stage 2: Shuffle and reduce by key (month).

Question:

Why does Spark break jobs into stages?

Answer:

To isolate and manage shuffle boundaries. Operations without shuffling can run in the same stage. Shuffle operations need data to be grouped differently, so they form new stages.

What is a Task?

A Task is the smallest unit of execution in Spark. Each stage is divided into multiple tasks — one for each data partition. Tasks are distributed to worker nodes (executors).

Example:

If a DataFrame has 4 partitions, and the job is broken into 2 stages, then each stage will have 4 tasks — one for each partition. So Spark will execute 8 tasks in total.

Question:

Does more partitions always mean faster execution?

Answer:

Not always. Too few partitions underutilize the cluster. Too many may add overhead. A balanced number of partitions ensures good performance.

Visualizing: Job → Stages → Tasks

Let’s summarize how this works:

  1. You write code like df.groupBy().agg().show()
  2. Spark creates a Job for the action (show())
  3. Spark breaks the job into Stages at shuffle boundaries
  4. Each stage is split into Tasks, one per partition
  5. Tasks are distributed to executors for parallel execution

Summary

Understanding Jobs, Stages, and Tasks helps you optimize performance and debug Spark applications effectively. These concepts are also essential when interpreting the Spark UI and job execution plan.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M