Apache Spark CourseApache Spark Course1
Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Dropping Nulls and Handling Missing Values in Apache Spark

Dropping Nulls and Handling Missing Values in Apache Spark

In real-world datasets, missing values (nulls) are very common. These can result from data corruption, incomplete data entry, or sensor failures. Handling them properly is a critical part of the data cleaning process before any analysis or machine learning can be done.

What is a Null Value?

A null value represents missing or unknown data. In Apache Spark, a null means that the value is not present in a cell. For example, if a customer didn’t fill in their phone number, that field might be null in the dataset.

Why Are Missing Values a Problem?

  • They can break downstream calculations (e.g., sum, average)
  • They reduce model accuracy if not treated properly
  • They can create bias in analysis if not handled carefully

Question:

Why not just ignore missing values?

Answer:

Because ignoring them may lead to inaccurate analysis or errors in transformation and machine learning steps. It’s important to make an informed choice — either drop or fill them depending on the context.

Creating a Sample Dataset with Nulls

Let's simulate a dataset with nulls using PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("NullHandling").getOrCreate()

data = [
    ("Alice", 34, "New York"),
    ("Bob", None, "Los Angeles"),
    (None, 45, "Chicago"),
    ("Daisy", 29, None),
    ("Evan", None, None)
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("City", StringType(), True)
])

df = spark.createDataFrame(data, schema)
df.show()
+-----+----+-----------+
| Name| Age|       City|
+-----+----+-----------+
|Alice|  34|   New York|
|  Bob|null|Los Angeles|
| null|  45|    Chicago|
|Daisy|  29|       null|
| Evan|null|       null|
+-----+----+-----------+

How to Drop Rows with Null Values

If you want to remove all rows that contain any null values:

df_drop_any = df.na.drop()
df_drop_any.show()
+-----+---+---------+
| Name|Age|     City|
+-----+---+---------+
|Alice| 34| New York|
+-----+---+---------+

Only rows with no nulls are kept.

Question:

What if we only want to drop rows where all fields are null?

Answer:

Use df.na.drop("all") to remove only rows where every column is null.

Drop Rows Based on Specific Columns

If we only care about nulls in the Age column:

df_drop_age_nulls = df.na.drop(subset=["Age"])
df_drop_age_nulls.show()
+-----+----+-----------+
| Name| Age|       City|
+-----+----+-----------+
|Alice|  34|   New York|
| null|  45|    Chicago|
|Daisy|  29|       null|
+-----+----+-----------+

Only rows with Age not null are kept.

Filling Missing Values

Sometimes, instead of dropping rows, we can fill in missing values using a default or computed value.

Filling with Static Values

df_filled = df.na.fill({
    "Name": "Unknown",
    "Age": 0,
    "City": "Unknown"
})
df_filled.show()
+-------+---+---------+
|   Name|Age|     City|
+-------+---+---------+
|  Alice| 34| New York|
|    Bob|  0|Los Angeles|
|Unknown| 45|  Chicago|
|  Daisy| 29|  Unknown|
|   Evan|  0|  Unknown|
+-------+---+---------+

Question:

When should we prefer filling over dropping?

Answer:

If the number of missing values is small or if they can be reasonably inferred (e.g., Age = 0 as a placeholder), filling avoids losing useful data.

Filling with Mean or Median

Spark doesn’t directly support filling with mean in na.fill(), but we can do it like this:

from pyspark.sql.functions import mean

mean_age = df.select(mean("Age")).collect()[0][0]
df_filled_mean = df.na.fill({"Age": int(mean_age)})
df_filled_mean.show()
+-----+---+-----------+
| Name|Age|       City|
+-----+---+-----------+
|Alice| 34|   New York|
|  Bob| 36|Los Angeles|
| null| 45|    Chicago|
|Daisy| 29|       null|
| Evan| 36|       null|
+-----+---+-----------+

This fills missing ages with the average age value.

Summary

  • Use df.na.drop() to remove rows with nulls.
  • Use df.na.fill() to replace nulls with default or computed values.
  • Always analyze the nature of your data before choosing to drop or fill nulls.

Handling missing values is a foundational skill in data preprocessing. It ensures cleaner datasets and improves the reliability of downstream analytics and modeling tasks.