Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

What is an RDD in Apache Spark?



What is an RDD in Apache Spark?

RDD stands for Resilient Distributed Dataset. It is the fundamental data structure in Apache Spark, representing an immutable, distributed collection of objects that can be processed in parallel across a cluster.

RDDs are fault-tolerant, which means if a part of the data is lost due to a failure, Spark can automatically rebuild it using lineage information.

Why Use RDDs?

Real-Life Analogy

Imagine you have a huge library of books and you want to count the number of times the word "Spark" appears in all books. Instead of one person doing all the work, you divide the books among 10 people and let each count their share. Once everyone is done, you collect and sum the results. This is how RDDs work — data is divided and processed in parallel.

Creating RDDs in PySpark

There are mainly two ways to create RDDs:

  1. By parallelizing an existing Python collection
  2. By loading external datasets (e.g., text files)

Example 1: Create RDD from a Python List


from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("RDD Example").getOrCreate()

# Access SparkContext from SparkSession
sc = spark.sparkContext

# Create RDD from list
data = [10, 20, 30, 40, 50]
rdd = sc.parallelize(data)

# Perform an action (e.g., sum)
total = rdd.sum()
print("Sum of elements:", total)
    
Sum of elements: 150
    

Question:

Why use sc.parallelize() instead of just working with the Python list?

Answer:

Using parallelize() distributes the data across Spark worker nodes, enabling parallel computation on very large datasets. A regular Python list would run in a single thread and cannot scale for big data processing.

Example 2: Create RDD from a Text File


# Load RDD from external text file (each line becomes one element in RDD)
rdd_file = sc.textFile("sample.txt")

# Show first 5 lines
for line in rdd_file.take(5):
    print(line)
    
Line 1 of the file
Line 2 of the file
Line 3 of the file
Line 4 of the file
Line 5 of the file
    

This method is used to read large datasets stored on HDFS, S3, or even your local machine.

RDD Transformations vs Actions

Transformations are operations that define a new RDD from an existing one, like map() or filter(). They are lazy, meaning they don’t compute anything until an action is called.

Actions like count() or collect() trigger the actual computation and return results.

Example 3: Filter and Transform RDD


# Filter even numbers and square them
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
squared_rdd = filtered_rdd.map(lambda x: x * x)

print("Squared even numbers:", squared_rdd.collect())
    
Squared even numbers: [400, 1600]
    

Question:

Why didn’t Spark run the filter() and map() immediately?

Answer:

Because transformations are lazy. Spark waits until you call an action like collect() to optimize and execute the entire lineage graph.

Key Properties of RDD

When Should You Use RDDs?

Use RDDs when you need:

Summary

RDDs are the backbone of Apache Spark. They give you control over distributed data processing while ensuring fault tolerance and parallelism. For most beginners, learning to create, transform, and act on RDDs is the first step in mastering Spark.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M