Selecting, Filtering, and Transforming Data in Spark DataFrames
Working with large datasets often involves extracting relevant data by selecting specific columns, filtering rows based on conditions, and applying transformations. In Apache Spark, the DataFrame
API makes these operations easy and efficient, even for beginners.
Selecting Specific Columns
When you load a DataFrame, it may contain many columns, but often you only need a few for your analysis. You can use the select()
method to choose specific columns.
Example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SelectExample").getOrCreate()
# Sample data
data = [
("Alice", 25, "New York"),
("Bob", 30, "San Francisco"),
("Charlie", 28, "London")
]
# Create DataFrame
df = spark.createDataFrame(data, ["name", "age", "city"])
# Select specific columns
df.select("name", "city").show()
+-------+-------------+ | name| city| +-------+-------------+ | Alice| New York| | Bob|San Francisco| |Charlie| London| +-------+-------------+
Question:
What happens if you try to select a column that doesn't exist?
Answer:
Spark will throw an error. Always make sure the column name you specify matches exactly (including case sensitivity) with the schema.
Filtering Rows Based on Conditions
To work only with relevant data, we often need to filter rows that meet specific conditions using filter()
or where()
methods.
Example
Let’s say we want to filter people whose age is greater than 26.
# Filter rows where age > 26
df.filter(df.age > 26).show()
+-------+---+-------------+ | name|age| city| +-------+---+-------------+ | Bob| 30|San Francisco| |Charlie| 28| London| +-------+---+-------------+
You can also combine multiple conditions using &
(AND), |
(OR), and ~
(NOT).
Example: Filtering with Multiple Conditions
# Filter people older than 26 and living in London
df.filter((df.age > 26) & (df.city == "London")).show()
+-------+---+------+ | name|age| city| +-------+---+------+ |Charlie| 28|London| +-------+---+------+
Question:
Why do we use parentheses around each condition?
Answer:
Because logical operators like &
and |
have lower precedence than method calls. Parentheses ensure the correct evaluation order.
Transforming Data
Transformation refers to modifying or deriving new data columns using existing ones. Common transformations include:
- Renaming columns
- Creating new columns with calculations
- Changing column values using expressions
Example: Adding a New Column
Let’s add a new column called age_plus_5
by adding 5 to the existing age.
from pyspark.sql.functions import col
df.withColumn("age_plus_5", col("age") + 5).show()
+-------+---+-------------+-----------+ | name|age| city|age_plus_5| +-------+---+-------------+-----------+ | Alice| 25| New York| 30| | Bob| 30|San Francisco| 35| |Charlie| 28| London| 33| +-------+---+-------------+-----------+
Example: Renaming a Column
df.withColumnRenamed("city", "location").show()
+-------+---+-------------+ | name|age| location| +-------+---+-------------+ | Alice| 25| New York| | Bob| 30|San Francisco| |Charlie| 28| London| +-------+---+-------------+
Question:
Is the original DataFrame modified permanently when we use withColumn
or withColumnRenamed
?
Answer:
No. Spark DataFrames are immutable. Every transformation returns a new DataFrame. You must assign it to a variable if you want to store the changes.
Summary
In this section, we learned how to:
- Select one or more columns from a DataFrame using
select()
- Filter rows based on conditions using
filter()
orwhere()
- Create new columns or rename existing ones using
withColumn()
andwithColumnRenamed()
These operations are essential for cleaning, preparing, and analyzing data before applying any machine learning or reporting logic. Mastering them is key to working effectively with Spark DataFrames.