⬅ Previous Topic
Dropping Nulls and Handling Missing Values in Apache SparkNext Topic ⮕
Window Functions in Apache Spark⬅ Previous Topic
Dropping Nulls and Handling Missing Values in Apache SparkNext Topic ⮕
Window Functions in Apache SparkWhen working with large datasets, it’s common to encounter messy data. Apache Spark provides powerful DataFrame functions that help you clean and transform data at scale. In this tutorial, we’ll focus on three key operations: replacing values, filtering rows, and grouping data to derive insights.
Sometimes we may want to replace missing, incorrect, or placeholder values. For example, we might replace nulls with default values or incorrect codes with readable labels.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("ReplaceExample").getOrCreate()
data = [
("Alice", 34, "New York"),
("Bob", None, "Los Angeles"),
("Charlie", 29, None)
]
columns = ["Name", "Age", "City"]
df = spark.createDataFrame(data, columns)
# Replace nulls with default values
df_filled = df.fillna({"Age": 0, "City": "Unknown"})
df_filled.show()
+-------+---+-----------+ | Name|Age| City| +-------+---+-----------+ | Alice| 34| New York| | Bob| 0|Los Angeles| |Charlie| 29| Unknown| +-------+---+-----------+
Why do we replace nulls with values like 0 or "Unknown"?
Because leaving nulls can cause issues in aggregations or analytics. By replacing them, we maintain consistency and avoid errors in downstream processing.
Filtering means selecting only the rows that match certain conditions. This helps narrow down the data for analysis.
# Filter people older than 30
filtered_df = df.filter(col("Age") > 30)
filtered_df.show()
+-----+---+--------+ | Name|Age| City| +-----+---+--------+ |Alice| 34|New York| +-----+---+--------+
What happens if the column contains nulls?
Nulls are excluded in numeric comparisons (e.g., null > 30 returns false), so those rows are not shown unless explicitly handled.
Grouping is used to summarize or aggregate data based on one or more columns. For instance, you might want to calculate the average salary by department or the count of users by city.
from pyspark.sql.functions import avg
# Replace nulls before grouping
df_clean = df.fillna({"Age": 0, "City": "Unknown"})
# Group by City and compute average age
grouped_df = df_clean.groupBy("City").agg(avg("Age").alias("Average_Age"))
grouped_df.show()
+-----------+-----------+ | City|Average_Age| +-----------+-----------+ | New York| 34.0| |Los Angeles| 0.0| | Unknown| 29.0| +-----------+-----------+
Why do we often replace nulls before grouping?
If we don’t, Spark may skip those rows or create a separate group for nulls, which could distort the results or make analysis harder.
In real projects, we often use these operations together to clean, subset, and analyze data.
# Step 1: Clean nulls
df_clean = df.fillna({"Age": 0, "City": "Unknown"})
# Step 2: Filter Age > 20
df_filtered = df_clean.filter(col("Age") > 20)
# Step 3: Group by City and calculate average
result = df_filtered.groupBy("City").agg(avg("Age").alias("Avg_Age"))
result.show()
+-----------+--------+ | City|Avg_Age| +-----------+--------+ | New York| 34.0| | Unknown| 29.0| +-----------+--------+
Replacing, filtering, and grouping are foundational operations in data transformation. They allow us to clean messy datasets, extract meaningful insights, and prepare data for deeper analysis or machine learning. Mastering these in PySpark will help you tackle large datasets with confidence.
⬅ Previous Topic
Dropping Nulls and Handling Missing Values in Apache SparkNext Topic ⮕
Window Functions in Apache SparkYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.