⬅ Previous Topic
Creating RDDs from Collections and FilesNext Topic ⮕
Persistence and Caching in Apache Spark⬅ Previous Topic
Creating RDDs from Collections and FilesNext Topic ⮕
Persistence and Caching in Apache SparkWhen working with RDDs in Apache Spark, there are two types of operations you can perform — Transformations and Actions. Understanding the difference between these two is fundamental for building efficient and correct Spark applications.
Transformations are operations that create a new RDD from an existing RDD. These are lazy operations, which means they do not execute immediately. Instead, they build up a logical execution plan. Nothing actually happens until an action is called.
map()
– Applies a function to each elementfilter()
– Filters elements based on a conditionflatMap()
– Flattens the resultsdistinct()
– Removes duplicatesunion()
– Combines two RDDsmap()
from pyspark import SparkContext
sc = SparkContext("local", "Transformations Example")
# Create RDD
data = sc.parallelize([1, 2, 3, 4])
# Transformation
squared = data.map(lambda x: x * x)
# This line does not trigger any computation yet
Until now, Spark hasn't actually done anything. It has just built the plan.
Why does Spark wait and not calculate right away?
This is due to Spark's lazy evaluation. It allows Spark to optimize the entire data processing workflow before execution, reducing memory and computation overhead.
Actions are operations that trigger computation of RDDs and return a result to the driver program or write it to storage. Actions execute the transformations defined earlier.
collect()
– Returns all elements to the drivercount()
– Returns the number of elementstake(n)
– Returns first n
elementssaveAsTextFile()
– Saves RDD to diskreduce()
– Combines elements using a functioncollect()
# Now trigger the computation with an action
result = squared.collect()
print(result)
[1, 4, 9, 16]
The collect()
action fetched the squared values, which were computed at this point.
If we call multiple transformations before an action, do they all run at once?
Yes. Spark constructs a Directed Acyclic Graph (DAG) of all transformations and runs them as one optimized job when an action is triggered.
Let's go through a real-world style example: filtering and processing user click counts.
click_data = sc.parallelize(["user1,5", "user2,3", "user3,8", "user1,2"])
# Split records and sum clicks per user
mapped = click_data.map(lambda x: (x.split(",")[0], int(x.split(",")[1])))
summed = mapped.reduceByKey(lambda a, b: a + b)
# Trigger action
output = summed.collect()
print(output)
[('user1', 7), ('user2', 3), ('user3', 8)]
This example shows how transformations (map
, reduceByKey
) are composed and evaluated together when an action (collect
) is called.
Aspect | Transformations | Actions |
---|---|---|
Execution | Lazy (executed only when action is called) | Triggers execution |
Return Value | Returns a new RDD | Returns final value/result |
Examples | map() , filter() , flatMap() |
collect() , count() , saveAsTextFile() |
Understanding the difference between transformations and actions is crucial for writing efficient Spark programs. Transformations build up the computation logic, and actions trigger execution. Always remember: transformations define what to do, and actions define when to do it.
⬅ Previous Topic
Creating RDDs from Collections and FilesNext Topic ⮕
Persistence and Caching in Apache SparkYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.