Dropping Nulls and Handling Missing Values in Apache Spark
In real-world datasets, missing values (nulls) are very common. These can result from data corruption, incomplete data entry, or sensor failures. Handling them properly is a critical part of the data cleaning process before any analysis or machine learning can be done.
What is a Null Value?
A null value represents missing or unknown data. In Apache Spark, a null means that the value is not present in a cell. For example, if a customer didn’t fill in their phone number, that field might be null in the dataset.
Why Are Missing Values a Problem?
- They can break downstream calculations (e.g., sum, average)
- They reduce model accuracy if not treated properly
- They can create bias in analysis if not handled carefully
Question:
Why not just ignore missing values?
Answer:
Because ignoring them may lead to inaccurate analysis or errors in transformation and machine learning steps. It’s important to make an informed choice — either drop or fill them depending on the context.
Creating a Sample Dataset with Nulls
Let's simulate a dataset with nulls using PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.appName("NullHandling").getOrCreate()
data = [
("Alice", 34, "New York"),
("Bob", None, "Los Angeles"),
(None, 45, "Chicago"),
("Daisy", 29, None),
("Evan", None, None)
]
schema = StructType([
StructField("Name", StringType(), True),
StructField("Age", IntegerType(), True),
StructField("City", StringType(), True)
])
df = spark.createDataFrame(data, schema)
df.show()
+-----+----+-----------+
| Name| Age| City|
+-----+----+-----------+
|Alice| 34| New York|
| Bob|null|Los Angeles|
| null| 45| Chicago|
|Daisy| 29| null|
| Evan|null| null|
+-----+----+-----------+
How to Drop Rows with Null Values
If you want to remove all rows that contain any null values:
df_drop_any = df.na.drop()
df_drop_any.show()
+-----+---+---------+
| Name|Age| City|
+-----+---+---------+
|Alice| 34| New York|
+-----+---+---------+
Only rows with no nulls are kept.
Question:
What if we only want to drop rows where all fields are null?
Answer:
Use df.na.drop("all")
to remove only rows where every column is null.
Drop Rows Based on Specific Columns
If we only care about nulls in the Age
column:
df_drop_age_nulls = df.na.drop(subset=["Age"])
df_drop_age_nulls.show()
+-----+----+-----------+
| Name| Age| City|
+-----+----+-----------+
|Alice| 34| New York|
| null| 45| Chicago|
|Daisy| 29| null|
+-----+----+-----------+
Only rows with Age
not null are kept.
Filling Missing Values
Sometimes, instead of dropping rows, we can fill in missing values using a default or computed value.
Filling with Static Values
df_filled = df.na.fill({
"Name": "Unknown",
"Age": 0,
"City": "Unknown"
})
df_filled.show()
+-------+---+---------+
| Name|Age| City|
+-------+---+---------+
| Alice| 34| New York|
| Bob| 0|Los Angeles|
|Unknown| 45| Chicago|
| Daisy| 29| Unknown|
| Evan| 0| Unknown|
+-------+---+---------+
Question:
When should we prefer filling over dropping?
Answer:
If the number of missing values is small or if they can be reasonably inferred (e.g., Age = 0 as a placeholder), filling avoids losing useful data.
Filling with Mean or Median
Spark doesn’t directly support filling with mean in na.fill()
, but we can do it like this:
from pyspark.sql.functions import mean
mean_age = df.select(mean("Age")).collect()[0][0]
df_filled_mean = df.na.fill({"Age": int(mean_age)})
df_filled_mean.show()
+-----+---+-----------+
| Name|Age| City|
+-----+---+-----------+
|Alice| 34| New York|
| Bob| 36|Los Angeles|
| null| 45| Chicago|
|Daisy| 29| null|
| Evan| 36| null|
+-----+---+-----------+
This fills missing ages with the average age value.
Summary
- Use
df.na.drop()
to remove rows with nulls.
- Use
df.na.fill()
to replace nulls with default or computed values.
- Always analyze the nature of your data before choosing to drop or fill nulls.
Handling missing values is a foundational skill in data preprocessing. It ensures cleaner datasets and improves the reliability of downstream analytics and modeling tasks.