What is an RDD in Apache Spark?

⬅ Previous TopicUnderstanding DAG and Lazy Evaluation in Apache Spark

Next Topic ⮕Creating RDDs from Collections and Files

What is an RDD in Apache Spark?

RDD stands for Resilient Distributed Dataset. It is the fundamental data structure in Apache Spark, representing an immutable, distributed collection of objects that can be processed in parallel across a cluster.

RDDs are fault-tolerant, which means if a part of the data is lost due to a failure, Spark can automatically rebuild it using lineage information.

Why Use RDDs?

They enable parallel processing of large datasets.
They are fault-tolerant using lineage graphs.
They support in-memory computation for faster performance.
They provide fine-grained control over data transformations.

Real-Life Analogy

Imagine you have a huge library of books and you want to count the number of times the word "Spark" appears in all books. Instead of one person doing all the work, you divide the books among 10 people and let each count their share. Once everyone is done, you collect and sum the results. This is how RDDs work — data is divided and processed in parallel.

Creating RDDs in PySpark

There are mainly two ways to create RDDs:

By parallelizing an existing Python collection
By loading external datasets (e.g., text files)

Example 1: Create RDD from a Python List

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("RDD Example").getOrCreate()

# Access SparkContext from SparkSession
sc = spark.sparkContext

# Create RDD from list
data = [10, 20, 30, 40, 50]
rdd = sc.parallelize(data)

# Perform an action (e.g., sum)
total = rdd.sum()
print("Sum of elements:", total)

Sum of elements: 150

Question:

Why use sc.parallelize() instead of just working with the Python list?

Answer:

Using parallelize() distributes the data across Spark worker nodes, enabling parallel computation on very large datasets. A regular Python list would run in a single thread and cannot scale for big data processing.

Example 2: Create RDD from a Text File

# Load RDD from external text file (each line becomes one element in RDD)
rdd_file = sc.textFile("sample.txt")

# Show first 5 lines
for line in rdd_file.take(5):
    print(line)

Line 1 of the file
Line 2 of the file
Line 3 of the file
Line 4 of the file
Line 5 of the file

This method is used to read large datasets stored on HDFS, S3, or even your local machine.

RDD Transformations vs Actions

Transformations are operations that define a new RDD from an existing one, like map() or filter(). They are lazy, meaning they don’t compute anything until an action is called.

Actions like count() or collect() trigger the actual computation and return results.

Example 3: Filter and Transform RDD

# Filter even numbers and square them
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
squared_rdd = filtered_rdd.map(lambda x: x * x)

print("Squared even numbers:", squared_rdd.collect())

Squared even numbers: [400, 1600]

Question:

Why didn’t Spark run the filter() and map() immediately?

Answer:

Because transformations are lazy. Spark waits until you call an action like collect() to optimize and execute the entire lineage graph.

Key Properties of RDD

Immutable: Once created, you can't change an RDD — only transform it into a new one.
Distributed: Data is split and processed across multiple nodes.
Lazy Evaluation: Transformations are not executed until an action is triggered.
Fault Tolerant: Lost data can be recomputed using lineage (the record of how the RDD was built).

When Should You Use RDDs?

Use RDDs when you need:

Fine-grained control over low-level data transformations
Custom processing logic that is hard to express using DataFrames
To work with unstructured data (e.g., logs)

Summary

RDDs are the backbone of Apache Spark. They give you control over distributed data processing while ensuring fault tolerance and parallelism. For most beginners, learning to create, transform, and act on RDDs is the first step in mastering Spark.

⬅ Previous TopicUnderstanding DAG and Lazy Evaluation in Apache Spark

Next Topic ⮕Creating RDDs from Collections and Files

What is an RDD in Apache Spark?

What is an RDD in Apache Spark?

Why Use RDDs?

Real-Life Analogy

Creating RDDs in PySpark

Example 1: Create RDD from a Python List

Question:

Answer:

Example 2: Create RDD from a Text File

RDD Transformations vs Actions

Example 3: Filter and Transform RDD

Question:

Answer:

Key Properties of RDD

When Should You Use RDDs?

Summary

Module 4: RDDs - Resilient Distributed Datasets❯

Welcome to ProgramGuru

Player Settings