Dropping Nulls and Handling Missing Values in Apache Spark

In real-world datasets, missing values (nulls) are very common. These can result from data corruption, incomplete data entry, or sensor failures. Handling them properly is a critical part of the data cleaning process before any analysis or machine learning can be done.

What is a Null Value?

A null value represents missing or unknown data. In Apache Spark, a null means that the value is not present in a cell. For example, if a customer didn’t fill in their phone number, that field might be null in the dataset.

Why Are Missing Values a Problem?

They can break downstream calculations (e.g., sum, average)
They reduce model accuracy if not treated properly
They can create bias in analysis if not handled carefully

Question:

Why not just ignore missing values?

Answer:

Because ignoring them may lead to inaccurate analysis or errors in transformation and machine learning steps. It’s important to make an informed choice — either drop or fill them depending on the context.

Creating a Sample Dataset with Nulls

Let's simulate a dataset with nulls using PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("NullHandling").getOrCreate()

data = [
    ("Alice", 34, "New York"),
    ("Bob", None, "Los Angeles"),
    (None, 45, "Chicago"),
    ("Daisy", 29, None),
    ("Evan", None, None)
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("City", StringType(), True)
])

df = spark.createDataFrame(data, schema)
df.show()

+-----+----+-----------+
| Name| Age|       City|
+-----+----+-----------+
|Alice|  34|   New York|
|  Bob|null|Los Angeles|
| null|  45|    Chicago|
|Daisy|  29|       null|
| Evan|null|       null|
+-----+----+-----------+

How to Drop Rows with Null Values

If you want to remove all rows that contain any null values:

df_drop_any = df.na.drop()
df_drop_any.show()

+-----+---+---------+
| Name|Age|     City|
+-----+---+---------+
|Alice| 34| New York|
+-----+---+---------+

Only rows with no nulls are kept.

Question:

What if we only want to drop rows where all fields are null?

Answer:

Use df.na.drop("all") to remove only rows where every column is null.

Drop Rows Based on Specific Columns

If we only care about nulls in the Age column:

df_drop_age_nulls = df.na.drop(subset=["Age"])
df_drop_age_nulls.show()

+-----+----+-----------+
| Name| Age|       City|
+-----+----+-----------+
|Alice|  34|   New York|
| null|  45|    Chicago|
|Daisy|  29|       null|
+-----+----+-----------+

Only rows with Age not null are kept.

Filling Missing Values

Sometimes, instead of dropping rows, we can fill in missing values using a default or computed value.

Filling with Static Values

df_filled = df.na.fill({
    "Name": "Unknown",
    "Age": 0,
    "City": "Unknown"
})
df_filled.show()

+-------+---+---------+
|   Name|Age|     City|
+-------+---+---------+
|  Alice| 34| New York|
|    Bob|  0|Los Angeles|
|Unknown| 45|  Chicago|
|  Daisy| 29|  Unknown|
|   Evan|  0|  Unknown|
+-------+---+---------+

Question:

When should we prefer filling over dropping?

Answer:

If the number of missing values is small or if they can be reasonably inferred (e.g., Age = 0 as a placeholder), filling avoids losing useful data.

Filling with Mean or Median

Spark doesn’t directly support filling with mean in na.fill(), but we can do it like this:

from pyspark.sql.functions import mean

mean_age = df.select(mean("Age")).collect()[0][0]
df_filled_mean = df.na.fill({"Age": int(mean_age)})
df_filled_mean.show()

+-----+---+-----------+
| Name|Age|       City|
+-----+---+-----------+
|Alice| 34|   New York|
|  Bob| 36|Los Angeles|
| null| 45|    Chicago|
|Daisy| 29|       null|
| Evan| 36|       null|
+-----+---+-----------+

This fills missing ages with the average age value.

Summary

Use df.na.drop() to remove rows with nulls.
Use df.na.fill() to replace nulls with default or computed values.
Always analyze the nature of your data before choosing to drop or fill nulls.

Handling missing values is a foundational skill in data preprocessing. It ensures cleaner datasets and improves the reliability of downstream analytics and modeling tasks.

⬅ Previous TopicSchema Evolution and Inference in Apache Spark

Next Topic ⮕Replacing, Filtering, and Grouping Data in PySpark

Dropping Nulls and Handling Missing Values in Apache Spark