⬅ Previous Topic
What is an RDD in Apache Spark?Next Topic ⮕
Transformations vs Actions in Apache Spark⬅ Previous Topic
What is an RDD in Apache Spark?Next Topic ⮕
Transformations vs Actions in Apache SparkRDD (Resilient Distributed Dataset) is the most fundamental data structure in Apache Spark. It's an immutable distributed collection of objects that can be processed in parallel. To start working with Spark, you must learn how to create RDDs.
You can create RDDs in two main ways:
If you already have a Python list or sequence, you can use the sc.parallelize()
method to create an RDD.
from pyspark import SparkContext
# Create a SparkContext
sc = SparkContext("local", "RDD Example")
# Create a list
data = [1, 2, 3, 4, 5]
# Convert the list into an RDD
rdd = sc.parallelize(data)
# Collect the RDD elements
print("RDD contents:", rdd.collect())
RDD contents: [1, 2, 3, 4, 5]
Why do we need to use sc.parallelize()
to create RDDs from Python lists?
Because PySpark needs to distribute the data across the cluster. parallelize()
tells Spark to divide the data into chunks that can be processed in parallel.
names = ["Alice", "Bob", "Charlie", "David"]
names_rdd = sc.parallelize(names)
# Print each name
print("Names in RDD:", names_rdd.collect())
Names in RDD: ['Alice', 'Bob', 'Charlie', 'David']
Spark can also create RDDs by reading data from a file stored in your system or distributed file systems like HDFS or S3.
Assume you have a file sample.txt
with the following content:
Hello Spark Welcome to Big Data RDDs are awesome
# Load text file as RDD
text_rdd = sc.textFile("sample.txt")
# Display the contents
print("Lines in file:", text_rdd.collect())
Lines in file: ['Hello Spark', 'Welcome to Big Data', 'RDDs are awesome']
What if the file is large and cannot fit into memory?
Spark processes data in a distributed way. The file is split into chunks, and each chunk is processed by different executors in parallel, making it scalable even for huge files.
# Split lines into words
words_rdd = text_rdd.flatMap(lambda line: line.split(" "))
# Count the words
print("Words:", words_rdd.collect())
print("Total word count:", words_rdd.count())
Words: ['Hello', 'Spark', 'Welcome', 'to', 'Big', 'Data', 'RDDs', 'are', 'awesome'] Total word count: 9
map
, flatMap
) and actions (like collect
, count
).As a beginner, remember: RDDs are your entry point to working with distributed data in Apache Spark. Practice creating RDDs from both small collections and large files to get comfortable with their behavior.
⬅ Previous Topic
What is an RDD in Apache Spark?Next Topic ⮕
Transformations vs Actions in Apache SparkYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.