You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.
Before we feed any dataset to a machine learning model, we must clean and prepare the data. Two of the most common issues you’ll face during data cleaning are:
Missing values can cause issues with ML models since most models do not accept data with missing entries. Let's look at why missing values occur and how to deal with them.
Suppose we are building a model to predict house prices and the dataset looks like this:
House ID | Location | Size (sqft) | Bedrooms | Price (in $) |
---|---|---|---|---|
1 | New York | 1000 | 2 | 500000 |
2 | Los Angeles | NaN | 3 | 650000 |
3 | NaN | 1500 | NaN | 700000 |
4 | Chicago | 1200 | 2 | NaN |
We have missing values in Location
, Size
, Bedrooms
, and Price
.
Why not always drop rows with missing values?
Dropping too many rows can lead to data loss and bias. It's best only if very few rows have missing values or the row contains little useful information.
import pandas as pd
import numpy as np
# Sample data
data = {
'House ID': [1, 2, 3, 4],
'Location': ['New York', 'Los Angeles', np.nan, 'Chicago'],
'Size (sqft)': [1000, np.nan, 1500, 1200],
'Bedrooms': [2, 3, np.nan, 2],
'Price': [500000, 650000, 700000, np.nan]
}
df = pd.DataFrame(data)
# Show original dataset
print("Original Data:")
print(df)
# Fill missing numerical values with mean
df['Size (sqft)'].fillna(df['Size (sqft)'].mean(), inplace=True)
df['Bedrooms'].fillna(df['Bedrooms'].mode()[0], inplace=True)
df['Price'].fillna(df['Price'].median(), inplace=True)
# Fill missing categorical value with mode
df['Location'].fillna(df['Location'].mode()[0], inplace=True)
# Final cleaned data
print("\nCleaned Data:")
print(df)
fillna()
is used to replace missing values.Size
Bedrooms
(most common value)Price
to reduce the impact of outliersLocation
(categorical)Outliers are data points that are significantly different from the rest. For example, in a dataset of house prices ranging from 300K to 800K, a price of 5 million is an outlier.
import matplotlib.pyplot as plt
# Boxplot to visualize outliers
plt.boxplot(df['Price'])
plt.title("Boxplot for Price")
plt.show()
# Detecting outliers using IQR
Q1 = df['Price'].quantile(0.25)
Q3 = df['Price'].quantile(0.75)
IQR = Q3 - Q1
# Filtering data
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_no_outliers = df[(df['Price'] >= lower_bound) & (df['Price'] <= upper_bound)]
print("Data without outliers:")
print(df_no_outliers)
quantile()
gets the 25th (Q1) and 75th (Q3) percentiles.IQR
is the Interquartile Range = Q3 - Q1Why use median instead of mean when handling outliers?
The median is less affected by outliers and better represents the center of skewed data, making it more reliable for imputation or central tendency in such cases.
Handling data quality is one of the most important skills in a machine learning project, even before the algorithm selection. The better the data, the better the model!
You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.