⬅ Previous Topic
Understanding DAG and Lazy Evaluation in Apache SparkNext Topic ⮕
Creating RDDs from Collections and Files⬅ Previous Topic
Understanding DAG and Lazy Evaluation in Apache SparkNext Topic ⮕
Creating RDDs from Collections and FilesRDD stands for Resilient Distributed Dataset. It is the fundamental data structure in Apache Spark, representing an immutable, distributed collection of objects that can be processed in parallel across a cluster.
RDDs are fault-tolerant, which means if a part of the data is lost due to a failure, Spark can automatically rebuild it using lineage information.
Imagine you have a huge library of books and you want to count the number of times the word "Spark" appears in all books. Instead of one person doing all the work, you divide the books among 10 people and let each count their share. Once everyone is done, you collect and sum the results. This is how RDDs work — data is divided and processed in parallel.
There are mainly two ways to create RDDs:
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName("RDD Example").getOrCreate()
# Access SparkContext from SparkSession
sc = spark.sparkContext
# Create RDD from list
data = [10, 20, 30, 40, 50]
rdd = sc.parallelize(data)
# Perform an action (e.g., sum)
total = rdd.sum()
print("Sum of elements:", total)
Sum of elements: 150
Why use sc.parallelize()
instead of just working with the Python list?
Using parallelize()
distributes the data across Spark worker nodes, enabling parallel computation on very large datasets. A regular Python list would run in a single thread and cannot scale for big data processing.
# Load RDD from external text file (each line becomes one element in RDD)
rdd_file = sc.textFile("sample.txt")
# Show first 5 lines
for line in rdd_file.take(5):
print(line)
Line 1 of the file Line 2 of the file Line 3 of the file Line 4 of the file Line 5 of the file
This method is used to read large datasets stored on HDFS, S3, or even your local machine.
Transformations are operations that define a new RDD from an existing one, like map()
or filter()
. They are lazy, meaning they don’t compute anything until an action is called.
Actions like count()
or collect()
trigger the actual computation and return results.
# Filter even numbers and square them
filtered_rdd = rdd.filter(lambda x: x % 2 == 0)
squared_rdd = filtered_rdd.map(lambda x: x * x)
print("Squared even numbers:", squared_rdd.collect())
Squared even numbers: [400, 1600]
Why didn’t Spark run the filter()
and map()
immediately?
Because transformations are lazy. Spark waits until you call an action like collect()
to optimize and execute the entire lineage graph.
Use RDDs when you need:
RDDs are the backbone of Apache Spark. They give you control over distributed data processing while ensuring fault tolerance and parallelism. For most beginners, learning to create, transform, and act on RDDs is the first step in mastering Spark.
⬅ Previous Topic
Understanding DAG and Lazy Evaluation in Apache SparkNext Topic ⮕
Creating RDDs from Collections and FilesYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.