Job, Stage, and Task in Apache Spark
To understand how Apache Spark executes your code behind the scenes, it’s essential to break down the three core components of its execution plan: Jobs, Stages, and Tasks. These are the building blocks of Spark's distributed processing engine.
What is a Job?
A Job is triggered when an action is called on a DataFrame or RDD — like count()
, collect()
, or save()
. It's the top-level unit of work submitted to Spark by the driver program.
Example:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("JobExample").getOrCreate()
df = spark.read.csv("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv", header=True)
df_filtered = df.filter(df["1958"] > 350)
df_filtered.show() # This is an action that triggers a Job
+-----+-----+-----+-----+ |Month|1958 |1959 |1960 | +-----+-----+-----+-----+ | MAR | 362 | 406 | 419 | | APR | 348 | 396 | 461 | | MAY | 363 | 420 | 472 | +-----+-----+-----+-----+
In this case, the show()
method is the action. When it's called, Spark creates a Job to compute the filtered DataFrame and display it.
Question:
When does Spark actually run the code?
Answer:
Only when an action is called. Until then, all transformations like filter()
are lazily evaluated and stored as a lineage graph.
What is a Stage?
A Stage is a logical group of tasks that can be executed together. Spark breaks each job into stages based on the presence of wide transformations like groupBy()
or join()
, which require data shuffling between partitions.
Stages are of two types:
- ShuffleMapStage: When Spark performs a shuffle (data is redistributed across nodes).
- ResultStage: When the final result needs to be returned to the driver.
Example:
from pyspark.sql.functions import avg
# Assume df is the same from earlier
df_grouped = df.groupBy("Month").agg(avg("1958").alias("avg_1958"))
df_grouped.show()
+-----+----------+ |Month|avg_1958 | +-----+----------+ | JAN | 340.0 | | FEB | 318.0 | | MAR | 362.0 | +-----+----------+
Here, groupBy()
introduces a shuffle. So, the job is broken into at least two stages:
- Stage 1: Read and map records.
- Stage 2: Shuffle and reduce by key (month).
Question:
Why does Spark break jobs into stages?
Answer:
To isolate and manage shuffle boundaries. Operations without shuffling can run in the same stage. Shuffle operations need data to be grouped differently, so they form new stages.
What is a Task?
A Task is the smallest unit of execution in Spark. Each stage is divided into multiple tasks — one for each data partition. Tasks are distributed to worker nodes (executors).
Example:
If a DataFrame has 4 partitions, and the job is broken into 2 stages, then each stage will have 4 tasks — one for each partition. So Spark will execute 8 tasks in total.
Question:
Does more partitions always mean faster execution?
Answer:
Not always. Too few partitions underutilize the cluster. Too many may add overhead. A balanced number of partitions ensures good performance.
Visualizing: Job → Stages → Tasks
Let’s summarize how this works:
- You write code like
df.groupBy().agg().show()
- Spark creates a Job for the action (
show()
) - Spark breaks the job into Stages at shuffle boundaries
- Each stage is split into Tasks, one per partition
- Tasks are distributed to executors for parallel execution
Summary
- Job: Created when an action is triggered.
- Stage: A set of tasks between shuffles.
- Task: Executes the same code on a specific partition.
Understanding Jobs, Stages, and Tasks helps you optimize performance and debug Spark applications effectively. These concepts are also essential when interpreting the Spark UI and job execution plan.