What is an RDD in Apache Spark?
RDD stands for Resilient Distributed Dataset. It is the fundamental data structure in Apache Spark, representing an immutable, distributed collection of objects that can be processed in parallel across a cluster.
RDDs are fault-tolerant, which means if a part of the data is lost due to a failure, Spark can automatically rebuild it using lineage information.
Why Use RDDs?
- They enable parallel processing of large datasets.
- They are fault-tolerant using lineage graphs.
- They support in-memory computation for faster performance.
- They provide fine-grained control over data transformations.
Real-Life Analogy
Imagine you have a huge library of books and you want to count the number of times the word "Spark" appears in all books. Instead of one person doing all the work, you divide the books among 10 people and let each count their share. Once everyone is done, you collect and sum the results. This is how RDDs work — data is divided and processed in parallel.
Creating RDDs in PySpark
There are mainly two ways to create RDDs:
- By parallelizing an existing Python collection
- By loading external datasets (e.g., text files)
Example 1: Create RDD from a Python List
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName("RDD Example").getOrCreate()
# Access SparkContext from SparkSession
sc = spark.sparkContext
# Create RDD from list
data = [10, 20, 30, 40, 50]
rdd = sc.parallelize(data)
# Perform an action (e.g., sum)
total = rdd.sum()
print("Sum of elements:", total)
Sum of elements: 150
Question:
Why use sc.parallelize()
instead of just working with the Python list?
Answer:
Using parallelize()
distributes the data across Spark worker nodes, enabling parallel computation on very large datasets. A regular Python list would run in a single thread and cannot scale for big data processing.
Example 2: Create RDD from a Text File
# Load RDD from external text file (each line becomes one element in RDD)
rdd_file = sc.textFile("sample.txt")
# Show first 5 lines
for line in rdd_file.take(5):
print(line)
Line 1 of the file Line 2 of the file Line 3 of the file Line 4 of the file Line 5 of the file
This method is used to read large datasets stored on HDFS, S3, or even your local machine.
RDD Transformations vs Actions
Transformations are operations that define a new RDD from an existing one, like map()
or filter()
. They are lazy, meaning they don’t compute anything until an action is called.
Actions like count()
or collect()
trigger the actual computation and return results.
Example 3: Filter and Transform RDD
# Filter even numbers and square them
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
squared_rdd = filtered_rdd.map(lambda x: x * x)
print("Squared even numbers:", squared_rdd.collect())
Squared even numbers: [400, 1600]
Question:
Why didn’t Spark run the filter()
and map()
immediately?
Answer:
Because transformations are lazy. Spark waits until you call an action like collect()
to optimize and execute the entire lineage graph.
Key Properties of RDD
- Immutable: Once created, you can't change an RDD — only transform it into a new one.
- Distributed: Data is split and processed across multiple nodes.
- Lazy Evaluation: Transformations are not executed until an action is triggered.
- Fault Tolerant: Lost data can be recomputed using lineage (the record of how the RDD was built).
When Should You Use RDDs?
Use RDDs when you need:
- Fine-grained control over low-level data transformations
- Custom processing logic that is hard to express using DataFrames
- To work with unstructured data (e.g., logs)
Summary
RDDs are the backbone of Apache Spark. They give you control over distributed data processing while ensuring fault tolerance and parallelism. For most beginners, learning to create, transform, and act on RDDs is the first step in mastering Spark.