Data Preprocessing with NumPy
Clean and Prepare Real-World Data
Next Topic ⮕Vectorization Techniques in NumPy
Introduction
Real-world data is rarely clean. Missing values, inconsistent formats, and outliers are common. Before any analysis or modeling, we must preprocess the data — and NumPy offers powerful tools to do this efficiently.
This tutorial will guide you step-by-step through preprocessing a dataset using NumPy. We’ll focus on practical tasks like loading data, handling missing values, converting data types, and verifying integrity.
Step 1: Load the Data into a NumPy Array
You can load external data using numpy.loadtxt()
or numpy.genfromtxt()
. The latter is especially useful when dealing with missing or incomplete values.
import numpy as np
# Simulated CSV data with missing values
data = np.genfromtxt('sample-data.csv', delimiter=',', dtype='float32', skip_header=1, filling_values=np.nan)
print(data)
Explanation:
delimiter=','
tells NumPy to expect comma-separated values.skip_header=1
ignores the first row (usually column headers).filling_values=np.nan
ensures missing values are represented asnp.nan
(Not a Number).
Output: A 2D array where missing values are marked as nan
.
Step 2: Identify Missing or Invalid Entries
Now that we’ve loaded the data, we need to find and quantify the missing values.
# Count missing values column-wise
missing_per_column = np.isnan(data).sum(axis=0)
print("Missing values per column:", missing_per_column)
This gives you a quick overview of where the data needs attention.
Step 3: Handle Missing Data
There are multiple strategies to handle missing data:
Option 1: Remove Rows with Missing Data
cleaned_data = data[~np.isnan(data).any(axis=1)]
print(cleaned_data)
This line filters out any row that has even a single missing value. It’s simple — but risky if too much data is lost.
Option 2: Replace Missing Data with Column Means
# Replace NaN with column mean
col_mean = np.nanmean(data, axis=0)
indices = np.where(np.isnan(data))
data[indices] = np.take(col_mean, indices[1])
print(data)
Verification:
print("Any missing values left?", np.isnan(data).any())
Output: False
— which confirms we’ve replaced all missing values.
Step 4: Normalize or Scale the Data
Scaling brings values into a uniform range, which is important for ML and statistical modeling.
# Min-Max normalization
data_min = data.min(axis=0)
data_max = data.max(axis=0)
normalized_data = (data - data_min) / (data_max - data_min)
print(normalized_data)
All values are now between 0 and 1. This removes bias caused by different units or scales across columns.
Step 5: Data Type Conversion
Often, you’ll need to cast values to a consistent data type, especially when preparing inputs for a model.
converted_data = data.astype('float64')
print("New dtype:", converted_data.dtype)
Always verify the dtype after conversion. Inconsistent types can lead to silent bugs or runtime errors.
Final Validation Checks
- Shape check:
data.shape
should match your expectations (rows x columns). - NaN check:
np.isnan(data).any()
should beFalse
. - Range check:
np.min(data), np.max(data)
should fall within expected bounds after scaling.
Conclusion
Data preprocessing is where the real work begins. Clean, consistent data is the foundation of reliable analysis and modeling. With NumPy’s fast and expressive syntax, you can wrangle messy datasets into structured form — ready for insights.
In the next lesson, we’ll explore how to work with structured CSV data, and even pair NumPy arrays with Pandas DataFrames.