⬅ Previous Topic
Job, Stage, and Task in Apache SparkNext Topic ⮕
What is an RDD in Apache Spark?⬅ Previous Topic
Job, Stage, and Task in Apache SparkNext Topic ⮕
What is an RDD in Apache Spark?Two of the most powerful concepts in Apache Spark's architecture are the DAG (Directed Acyclic Graph) and Lazy Evaluation. These features make Spark faster and more efficient compared to traditional big data tools like Hadoop MapReduce.
DAG stands for Directed Acyclic Graph. It is a graph where:
In Spark, whenever you perform a sequence of transformations on data, Spark creates a DAG behind the scenes. This DAG is a blueprint of all operations — like a pipeline that defines what steps are needed to reach the final result.
Let’s consider a simple word count example:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WordCountDAG").getOrCreate()
sc = spark.sparkContext
# Load data
lines = sc.parallelize(["hello world", "hello spark", "big data big deal"])
# Transformations
words = lines.flatMap(lambda line: line.split())
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda a, b: a + b)
# Action
print(wordCounts.collect())
[('hello', 2), ('world', 1), ('spark', 1), ('big', 2), ('data', 1), ('deal', 1)]
Here, Spark creates a DAG with the following stages:
flatMap
– Split each line into wordsmap
– Map each word to a (word, 1) pairreduceByKey
– Combine counts of each wordThe DAG tracks this sequence and waits until an action (like collect()
) is triggered to execute these steps.
In Spark, transformations like map()
or filter()
are lazy, meaning they are not executed immediately. Spark simply records them in the DAG. Only when an action like collect()
or count()
is called, Spark analyzes the DAG and executes the necessary operations.
Why does Spark wait instead of running each transformation immediately?
Waiting allows Spark to optimize the execution plan. By knowing the entire chain of operations, it can group them into stages and minimize data shuffling, which improves performance.
Think of it like planning a road trip. You don’t drive to one place, then decide the next. Instead, you map the entire trip and find the best route before starting. Spark does the same — it builds the full DAG (the travel plan), and then executes it efficiently when needed.
Suppose you want to bake a cake. You don't start baking immediately. You first gather ingredients, check the recipe, and plan the steps. Only after everything is in place do you begin the actual baking. This is lazy evaluation — plan first, execute when needed.
data = sc.parallelize([1, 2, 3, 4, 5, 6])
# Transformation (not executed yet)
even_numbers = data.filter(lambda x: x % 2 == 0)
# No output yet because it's lazy
# Now trigger the action
print(even_numbers.count())
3
Even though we wrote the filter
function, it did not run until we called count()
. That’s lazy evaluation in action.
Spark’s DAG and lazy evaluation features make it powerful and efficient for processing big data. By understanding these two concepts, you unlock how Spark optimizes workflows and delivers faster results.
⬅ Previous Topic
Job, Stage, and Task in Apache SparkNext Topic ⮕
What is an RDD in Apache Spark?You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.