Dropping Nulls and Handling Missing Values in Apache Spark
In real-world datasets, missing values (nulls) are very common. These can result from data corruption, incomplete data entry, or sensor failures. Handling them properly is a critical part of the data cleaning process before any analysis or machine learning can be done.
What is a Null Value?
A null value represents missing or unknown data. In Apache Spark, a null means that the value is not present in a cell. For example, if a customer didn’t fill in their phone number, that field might be null in the dataset.
Why Are Missing Values a Problem?
- They can break downstream calculations (e.g., sum, average)
- They reduce model accuracy if not treated properly
- They can create bias in analysis if not handled carefully
Question:
Why not just ignore missing values?
Answer:
Because ignoring them may lead to inaccurate analysis or errors in transformation and machine learning steps. It’s important to make an informed choice — either drop or fill them depending on the context.
Creating a Sample Dataset with Nulls
Let's simulate a dataset with nulls using PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.appName("NullHandling").getOrCreate()
data = [
("Alice", 34, "New York"),
("Bob", None, "Los Angeles"),
(None, 45, "Chicago"),
("Daisy", 29, None),
("Evan", None, None)
]
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("City", StringType(), True)
])
df = spark.createDataFrame(data, schema)
df.show()
+-----+----+-----------+ | Name| Age| City| +-----+----+-----------+ |Alice| 34| New York| | Bob|null|Los Angeles| | null| 45| Chicago| |Daisy| 29| null| | Evan|null| null| +-----+----+-----------+
How to Drop Rows with Null Values
If you want to remove all rows that contain any null values:
df_drop_any = df.na.drop()
df_drop_any.show()
+-----+---+---------+ | Name|Age| City| +-----+---+---------+ |Alice| 34| New York| +-----+---+---------+
Only rows with no nulls are kept.
Question:
What if we only want to drop rows where all fields are null?
Answer:
Use df.na.drop("all")
to remove only rows where every column is null.
Drop Rows Based on Specific Columns
If we only care about nulls in the Age
column:
df_drop_age_nulls = df.na.drop(subset=["Age"])
df_drop_age_nulls.show()
+-----+----+-----------+ | Name| Age| City| +-----+----+-----------+ |Alice| 34| New York| | null| 45| Chicago| |Daisy| 29| null| +-----+----+-----------+
Only rows with Age
not null are kept.
Filling Missing Values
Sometimes, instead of dropping rows, we can fill in missing values using a default or computed value.
Filling with Static Values
df_filled = df.na.fill({
"Name": "Unknown",
"Age": 0,
"City": "Unknown"
})
df_filled.show()
+-------+---+---------+ | Name|Age| City| +-------+---+---------+ | Alice| 34| New York| | Bob| 0|Los Angeles| |Unknown| 45| Chicago| | Daisy| 29| Unknown| | Evan| 0| Unknown| +-------+---+---------+
Question:
When should we prefer filling over dropping?
Answer:
If the number of missing values is small or if they can be reasonably inferred (e.g., Age = 0 as a placeholder), filling avoids losing useful data.
Filling with Mean or Median
Spark doesn’t directly support filling with mean in na.fill()
, but we can do it like this:
from pyspark.sql.functions import mean
mean_age = df.select(mean("Age")).collect()[0][0]
df_filled_mean = df.na.fill({"Age": int(mean_age)})
df_filled_mean.show()
+-----+---+-----------+ | Name|Age| City| +-----+---+-----------+ |Alice| 34| New York| | Bob| 36|Los Angeles| | null| 45| Chicago| |Daisy| 29| null| | Evan| 36| null| +-----+---+-----------+
This fills missing ages with the average age value.
Summary
- Use
df.na.drop()
to remove rows with nulls. - Use
df.na.fill()
to replace nulls with default or computed values. - Always analyze the nature of your data before choosing to drop or fill nulls.
Handling missing values is a foundational skill in data preprocessing. It ensures cleaner datasets and improves the reliability of downstream analytics and modeling tasks.