Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Dropping Nulls and Handling Missing Values in Apache Spark



Dropping Nulls and Handling Missing Values in Apache Spark

In real-world datasets, missing values (nulls) are very common. These can result from data corruption, incomplete data entry, or sensor failures. Handling them properly is a critical part of the data cleaning process before any analysis or machine learning can be done.

What is a Null Value?

A null value represents missing or unknown data. In Apache Spark, a null means that the value is not present in a cell. For example, if a customer didn’t fill in their phone number, that field might be null in the dataset.

Why Are Missing Values a Problem?

Question:

Why not just ignore missing values?

Answer:

Because ignoring them may lead to inaccurate analysis or errors in transformation and machine learning steps. It’s important to make an informed choice — either drop or fill them depending on the context.

Creating a Sample Dataset with Nulls

Let's simulate a dataset with nulls using PySpark:


from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("NullHandling").getOrCreate()

data = [
    ("Alice", 34, "New York"),
    ("Bob", None, "Los Angeles"),
    (None, 45, "Chicago"),
    ("Daisy", 29, None),
    ("Evan", None, None)
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("City", StringType(), True)
])

df = spark.createDataFrame(data, schema)
df.show()
    
+-----+----+-----------+
| Name| Age|       City|
+-----+----+-----------+
|Alice|  34|   New York|
|  Bob|null|Los Angeles|
| null|  45|    Chicago|
|Daisy|  29|       null|
| Evan|null|       null|
+-----+----+-----------+
    

How to Drop Rows with Null Values

If you want to remove all rows that contain any null values:


df_drop_any = df.na.drop()
df_drop_any.show()
    
+-----+---+---------+
| Name|Age|     City|
+-----+---+---------+
|Alice| 34| New York|
+-----+---+---------+
    

Only rows with no nulls are kept.

Question:

What if we only want to drop rows where all fields are null?

Answer:

Use df.na.drop("all") to remove only rows where every column is null.

Drop Rows Based on Specific Columns

If we only care about nulls in the Age column:


df_drop_age_nulls = df.na.drop(subset=["Age"])
df_drop_age_nulls.show()
    
+-----+----+-----------+
| Name| Age|       City|
+-----+----+-----------+
|Alice|  34|   New York|
| null|  45|    Chicago|
|Daisy|  29|       null|
+-----+----+-----------+
    

Only rows with Age not null are kept.

Filling Missing Values

Sometimes, instead of dropping rows, we can fill in missing values using a default or computed value.

Filling with Static Values


df_filled = df.na.fill({
    "Name": "Unknown",
    "Age": 0,
    "City": "Unknown"
})
df_filled.show()
    
+-------+---+---------+
|   Name|Age|     City|
+-------+---+---------+
|  Alice| 34| New York|
|    Bob|  0|Los Angeles|
|Unknown| 45|  Chicago|
|  Daisy| 29|  Unknown|
|   Evan|  0|  Unknown|
+-------+---+---------+
    

Question:

When should we prefer filling over dropping?

Answer:

If the number of missing values is small or if they can be reasonably inferred (e.g., Age = 0 as a placeholder), filling avoids losing useful data.

Filling with Mean or Median

Spark doesn’t directly support filling with mean in na.fill(), but we can do it like this:


from pyspark.sql.functions import mean

mean_age = df.select(mean("Age")).collect()[0][0]
df_filled_mean = df.na.fill({"Age": int(mean_age)})
df_filled_mean.show()
    
+-----+---+-----------+
| Name|Age|       City|
+-----+---+-----------+
|Alice| 34|   New York|
|  Bob| 36|Los Angeles|
| null| 45|    Chicago|
|Daisy| 29|       null|
| Evan| 36|       null|
+-----+---+-----------+
    

This fills missing ages with the average age value.

Summary

Handling missing values is a foundational skill in data preprocessing. It ensures cleaner datasets and improves the reliability of downstream analytics and modeling tasks.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M