Machine Learning for BeginnersMachine Learning for Beginners1

Handling Missing Values and Outliers in Machine Learning



Handling Missing Values and Outliers in Machine Learning

Before we feed any dataset to a machine learning model, we must clean and prepare the data. Two of the most common issues you’ll face during data cleaning are:

  1. Missing Values – when some data entries are blank or not available
  2. Outliers – data points that are significantly different from the rest of the data

Part 1: Handling Missing Values

Missing values can cause issues with ML models since most models do not accept data with missing entries. Let's look at why missing values occur and how to deal with them.

Why do missing values occur?

Example Dataset

Suppose we are building a model to predict house prices and the dataset looks like this:

House ID Location Size (sqft) Bedrooms Price (in $)
1 New York 1000 2 500000
2 Los Angeles NaN 3 650000
3 NaN 1500 NaN 700000
4 Chicago 1200 2 NaN

We have missing values in Location, Size, Bedrooms, and Price.

Common Strategies to Handle Missing Values

  1. Drop rows or columns
    • Drop rows where important values are missing
    • Drop entire columns if they are mostly empty
  2. Imputation
    • Numerical: Replace with mean, median, or mode
    • Categorical: Replace with most frequent value

🧠 Question:

Why not always drop rows with missing values?

Answer:

Dropping too many rows can lead to data loss and bias. It's best only if very few rows have missing values or the row contains little useful information.

Python Code – Handling Missing Values

import pandas as pd
import numpy as np

# Sample data
data = {
    'House ID': [1, 2, 3, 4],
    'Location': ['New York', 'Los Angeles', np.nan, 'Chicago'],
    'Size (sqft)': [1000, np.nan, 1500, 1200],
    'Bedrooms': [2, 3, np.nan, 2],
    'Price': [500000, 650000, 700000, np.nan]
}

df = pd.DataFrame(data)

# Show original dataset
print("Original Data:")
print(df)

# Fill missing numerical values with mean
df['Size (sqft)'].fillna(df['Size (sqft)'].mean(), inplace=True)
df['Bedrooms'].fillna(df['Bedrooms'].mode()[0], inplace=True)
df['Price'].fillna(df['Price'].median(), inplace=True)

# Fill missing categorical value with mode
df['Location'].fillna(df['Location'].mode()[0], inplace=True)

# Final cleaned data
print("\nCleaned Data:")
print(df)

Code Explanation:


Part 2: Handling Outliers

Outliers are data points that are significantly different from the rest. For example, in a dataset of house prices ranging from 300K to 800K, a price of 5 million is an outlier.

How do outliers affect ML models?

How to detect outliers?

  1. Boxplot – visually shows data distribution
  2. Z-score method – values beyond 3 standard deviations
  3. IQR method – values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR

Python Code – Detect and Remove Outliers using IQR

import matplotlib.pyplot as plt

# Boxplot to visualize outliers
plt.boxplot(df['Price'])
plt.title("Boxplot for Price")
plt.show()

# Detecting outliers using IQR
Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1

# Filtering data
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df_no_outliers = df[(df['Price'] >= lower_bound) & (df['Price'] <= upper_bound)]

print("Data without outliers:")
print(df_no_outliers)

Code Explanation:

🧠 Question:

Why use median instead of mean when handling outliers?

Answer:

The median is less affected by outliers and better represents the center of skewed data, making it more reliable for imputation or central tendency in such cases.


Summary

Handling data quality is one of the most important skills in a machine learning project, even before the algorithm selection. The better the data, the better the model!



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M