⬅ Previous Topic
Driver, Executors, and Cluster Manager in Apache SparkJob, Stage, and Task in Apache Spark
Job, Stage, and Task in Apache Spark
To understand how Apache Spark executes your code behind the scenes, it’s essential to break down the three core components of its execution plan: Jobs, Stages, and Tasks. These are the building blocks of Spark's distributed processing engine.
What is a Job?
A Job is triggered when an action is called on a DataFrame or RDD — like count()
, collect()
, or save()
. It's the top-level unit of work submitted to Spark by the driver program.
Example:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("JobExample").getOrCreate()
df = spark.read.csv("https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv", header=True)
df_filtered = df.filter(df["1958"] > 350)
df_filtered.show() # This is an action that triggers a Job
+-----+-----+-----+-----+ |Month|1958 |1959 |1960 | +-----+-----+-----+-----+ | MAR | 362 | 406 | 419 | | APR | 348 | 396 | 461 | | MAY | 363 | 420 | 472 | +-----+-----+-----+-----+
In this case, the show()
method is the action. When it's called, Spark creates a Job to compute the filtered DataFrame and display it.
Question:
When does Spark actually run the code?
Answer:
Only when an action is called. Until then, all transformations like filter()
are lazily evaluated and stored as a lineage graph.
What is a Stage?
A Stage is a logical group of tasks that can be executed together. Spark breaks each job into stages based on the presence of wide transformations like groupBy()
or join()
, which require data shuffling between partitions.
Stages are of two types:
- ShuffleMapStage: When Spark performs a shuffle (data is redistributed across nodes).
- ResultStage: When the final result needs to be returned to the driver.
Example:
from pyspark.sql.functions import avg
# Assume df is the same from earlier
df_grouped = df.groupBy("Month").agg(avg("1958").alias("avg_1958"))
df_grouped.show()
+-----+----------+ |Month|avg_1958 | +-----+----------+ | JAN | 340.0 | | FEB | 318.0 | | MAR | 362.0 | +-----+----------+
Here, groupBy()
introduces a shuffle. So, the job is broken into at least two stages:
- Stage 1: Read and map records.
- Stage 2: Shuffle and reduce by key (month).
Question:
Why does Spark break jobs into stages?
Answer:
To isolate and manage shuffle boundaries. Operations without shuffling can run in the same stage. Shuffle operations need data to be grouped differently, so they form new stages.
What is a Task?
A Task is the smallest unit of execution in Spark. Each stage is divided into multiple tasks — one for each data partition. Tasks are distributed to worker nodes (executors).
Example:
If a DataFrame has 4 partitions, and the job is broken into 2 stages, then each stage will have 4 tasks — one for each partition. So Spark will execute 8 tasks in total.
Question:
Does more partitions always mean faster execution?
Answer:
Not always. Too few partitions underutilize the cluster. Too many may add overhead. A balanced number of partitions ensures good performance.
Visualizing: Job → Stages → Tasks
Let’s summarize how this works:
- You write code like
df.groupBy().agg().show()
- Spark creates a Job for the action (
show()
) - Spark breaks the job into Stages at shuffle boundaries
- Each stage is split into Tasks, one per partition
- Tasks are distributed to executors for parallel execution
Summary
- Job: Created when an action is triggered.
- Stage: A set of tasks between shuffles.
- Task: Executes the same code on a specific partition.
Understanding Jobs, Stages, and Tasks helps you optimize performance and debug Spark applications effectively. These concepts are also essential when interpreting the Spark UI and job execution plan.
⬅ Previous Topic
Driver, Executors, and Cluster Manager in Apache Spark