Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Transformations vs Actions in Apache Spark



Transformations vs Actions in Apache Spark

When working with RDDs in Apache Spark, there are two types of operations you can perform — Transformations and Actions. Understanding the difference between these two is fundamental for building efficient and correct Spark applications.

What are Transformations?

Transformations are operations that create a new RDD from an existing RDD. These are lazy operations, which means they do not execute immediately. Instead, they build up a logical execution plan. Nothing actually happens until an action is called.

Common Transformations:

Example: Transformation with map()


from pyspark import SparkContext

sc = SparkContext("local", "Transformations Example")

# Create RDD
data = sc.parallelize([1, 2, 3, 4])

# Transformation
squared = data.map(lambda x: x * x)

# This line does not trigger any computation yet
    

Until now, Spark hasn't actually done anything. It has just built the plan.

Question:

Why does Spark wait and not calculate right away?

Answer:

This is due to Spark's lazy evaluation. It allows Spark to optimize the entire data processing workflow before execution, reducing memory and computation overhead.

What are Actions?

Actions are operations that trigger computation of RDDs and return a result to the driver program or write it to storage. Actions execute the transformations defined earlier.

Common Actions:

Example: Action with collect()


# Now trigger the computation with an action
result = squared.collect()
print(result)
    
[1, 4, 9, 16]
    

The collect() action fetched the squared values, which were computed at this point.

Question:

If we call multiple transformations before an action, do they all run at once?

Answer:

Yes. Spark constructs a Directed Acyclic Graph (DAG) of all transformations and runs them as one optimized job when an action is triggered.

Combining Transformations and Actions

Let's go through a real-world style example: filtering and processing user click counts.


click_data = sc.parallelize(["user1,5", "user2,3", "user3,8", "user1,2"])

# Split records and sum clicks per user
mapped = click_data.map(lambda x: (x.split(",")[0], int(x.split(",")[1])))
summed = mapped.reduceByKey(lambda a, b: a + b)

# Trigger action
output = summed.collect()
print(output)
    
[('user1', 7), ('user2', 3), ('user3', 8)]
    

This example shows how transformations (map, reduceByKey) are composed and evaluated together when an action (collect) is called.

Key Differences

Aspect Transformations Actions
Execution Lazy (executed only when action is called) Triggers execution
Return Value Returns a new RDD Returns final value/result
Examples map(), filter(), flatMap() collect(), count(), saveAsTextFile()

Summary

Understanding the difference between transformations and actions is crucial for writing efficient Spark programs. Transformations build up the computation logic, and actions trigger execution. Always remember: transformations define what to do, and actions define when to do it.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M