Transformations vs Actions in Apache Spark

When working with RDDs in Apache Spark, there are two types of operations you can perform — Transformations and Actions. Understanding the difference between these two is fundamental for building efficient and correct Spark applications.

What are Transformations?

Transformations are operations that create a new RDD from an existing RDD. These are lazy operations, which means they do not execute immediately. Instead, they build up a logical execution plan. Nothing actually happens until an action is called.

Common Transformations:

map() – Applies a function to each element
filter() – Filters elements based on a condition
flatMap() – Flattens the results
distinct() – Removes duplicates
union() – Combines two RDDs

Example: Transformation with `map()`

from pyspark import SparkContext

sc = SparkContext("local", "Transformations Example")

# Create RDD
data = sc.parallelize([1, 2, 3, 4])

# Transformation
squared = data.map(lambda x: x * x)

# This line does not trigger any computation yet

Until now, Spark hasn't actually done anything. It has just built the plan.

Question:

Why does Spark wait and not calculate right away?

Answer:

This is due to Spark's lazy evaluation. It allows Spark to optimize the entire data processing workflow before execution, reducing memory and computation overhead.

What are Actions?

Actions are operations that trigger computation of RDDs and return a result to the driver program or write it to storage. Actions execute the transformations defined earlier.

Common Actions:

collect() – Returns all elements to the driver
count() – Returns the number of elements
take(n) – Returns first n elements
saveAsTextFile() – Saves RDD to disk
reduce() – Combines elements using a function

Example: Action with `collect()`

# Now trigger the computation with an action
result = squared.collect()
print(result)

[1, 4, 9, 16]

The collect() action fetched the squared values, which were computed at this point.

Question:

If we call multiple transformations before an action, do they all run at once?

Answer:

Yes. Spark constructs a Directed Acyclic Graph (DAG) of all transformations and runs them as one optimized job when an action is triggered.

Combining Transformations and Actions

Let's go through a real-world style example: filtering and processing user click counts.

click_data = sc.parallelize(["user1,5", "user2,3", "user3,8", "user1,2"])

# Split records and sum clicks per user
mapped = click_data.map(lambda x: (x.split(",")[0], int(x.split(",")[1])))
summed = mapped.reduceByKey(lambda a, b: a + b)

# Trigger action
output = summed.collect()
print(output)

[('user1', 7), ('user2', 3), ('user3', 8)]

This example shows how transformations (map, reduceByKey) are composed and evaluated together when an action (collect) is called.

Key Differences

Aspect	Transformations	Actions
Execution	Lazy (executed only when action is called)	Triggers execution
Return Value	Returns a new RDD	Returns final value/result
Examples	`map()`, `filter()`, `flatMap()`	`collect()`, `count()`, `saveAsTextFile()`

Summary

Understanding the difference between transformations and actions is important for writing efficient Spark programs. Transformations build up the computation logic, and actions trigger execution. Always remember: transformations define what to do, and actions define when to do it.

Transformations vs Actions in Apache Spark