Replacing, Filtering, and Grouping Data in PySpark
When working with large datasets, it’s common to encounter messy data. Apache Spark provides powerful DataFrame functions that help you clean and transform data at scale. In this tutorial, we’ll focus on three key operations: replacing values, filtering rows, and grouping data to derive insights.
Replacing Data in PySpark
Sometimes we may want to replace missing, incorrect, or placeholder values. For example, we might replace nulls with default values or incorrect codes with readable labels.
Example: Replace null values with default values
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("ReplaceExample").getOrCreate()
data = [
("Alice", 34, "New York"),
("Bob", None, "Los Angeles"),
("Charlie", 29, None)
]
columns = ["Name", "Age", "City"]
df = spark.createDataFrame(data, columns)
# Replace nulls with default values
df_filled = df.fillna({"Age": 0, "City": "Unknown"})
df_filled.show()
+-------+---+-----------+ | Name|Age| City| +-------+---+-----------+ | Alice| 34| New York| | Bob| 0|Los Angeles| |Charlie| 29| Unknown| +-------+---+-----------+
Question:
Why do we replace nulls with values like 0 or "Unknown"?
Answer:
Because leaving nulls can cause issues in aggregations or analytics. By replacing them, we maintain consistency and avoid errors in downstream processing.
Filtering Data in PySpark
Filtering means selecting only the rows that match certain conditions. This helps narrow down the data for analysis.
Example: Filter rows based on condition
# Filter people older than 30
filtered_df = df.filter(col("Age") > 30)
filtered_df.show()
+-----+---+--------+ | Name|Age| City| +-----+---+--------+ |Alice| 34|New York| +-----+---+--------+
Question:
What happens if the column contains nulls?
Answer:
Nulls are excluded in numeric comparisons (e.g., null > 30 returns false), so those rows are not shown unless explicitly handled.
Grouping Data in PySpark
Grouping is used to summarize or aggregate data based on one or more columns. For instance, you might want to calculate the average salary by department or the count of users by city.
Example: Group by City and calculate average age
from pyspark.sql.functions import avg
# Replace nulls before grouping
df_clean = df.fillna({"Age": 0, "City": "Unknown"})
# Group by City and compute average age
grouped_df = df_clean.groupBy("City").agg(avg("Age").alias("Average_Age"))
grouped_df.show()
+-----------+-----------+ | City|Average_Age| +-----------+-----------+ | New York| 34.0| |Los Angeles| 0.0| | Unknown| 29.0| +-----------+-----------+
Question:
Why do we often replace nulls before grouping?
Answer:
If we don’t, Spark may skip those rows or create a separate group for nulls, which could distort the results or make analysis harder.
Combining Replace, Filter, and Group Operations
In real projects, we often use these operations together to clean, subset, and analyze data.
Example: Find average age of people older than 20 by city
# Step 1: Clean nulls
df_clean = df.fillna({"Age": 0, "City": "Unknown"})
# Step 2: Filter Age > 20
df_filtered = df_clean.filter(col("Age") > 20)
# Step 3: Group by City and calculate average
result = df_filtered.groupBy("City").agg(avg("Age").alias("Avg_Age"))
result.show()
+-----------+--------+ | City|Avg_Age| +-----------+--------+ | New York| 34.0| | Unknown| 29.0| +-----------+--------+
Summary
Replacing, filtering, and grouping are foundational operations in data transformation. They allow us to clean messy datasets, extract meaningful insights, and prepare data for deeper analysis or machine learning. Mastering these in PySpark will help you tackle large datasets with confidence.