Data Preprocessing with NumPy
Clean and Prepare Real-World Data

⬅ Previous TopicRead and Write Data from TXT & CSV Files Using NumPy

Next Topic ⮕Vectorization Techniques in NumPy

Introduction

Real-world data is rarely clean. Missing values, inconsistent formats, and outliers are common. Before any analysis or modeling, we must preprocess the data — and NumPy offers powerful tools to do this efficiently.

This tutorial will guide you step-by-step through preprocessing a dataset using NumPy. We’ll focus on practical tasks like loading data, handling missing values, converting data types, and verifying integrity.

Step 1: Load the Data into a NumPy Array

You can load external data using numpy.loadtxt() or numpy.genfromtxt(). The latter is especially useful when dealing with missing or incomplete values.

import numpy as np

# Simulated CSV data with missing values
data = np.genfromtxt('sample-data.csv', delimiter=',', dtype='float32', skip_header=1, filling_values=np.nan)

print(data)

Explanation:

delimiter=',' tells NumPy to expect comma-separated values.
skip_header=1 ignores the first row (usually column headers).
filling_values=np.nan ensures missing values are represented as np.nan (Not a Number).

Output: A 2D array where missing values are marked as nan.

Step 2: Identify Missing or Invalid Entries

Now that we’ve loaded the data, we need to find and quantify the missing values.

# Count missing values column-wise
missing_per_column = np.isnan(data).sum(axis=0)
print("Missing values per column:", missing_per_column)

This gives you a quick overview of where the data needs attention.

Step 3: Handle Missing Data

There are multiple strategies to handle missing data:

Option 1: Remove Rows with Missing Data

cleaned_data = data[~np.isnan(data).any(axis=1)]
print(cleaned_data)

This line filters out any row that has even a single missing value. It’s simple — but risky if too much data is lost.

Option 2: Replace Missing Data with Column Means

# Replace NaN with column mean
col_mean = np.nanmean(data, axis=0)
indices = np.where(np.isnan(data))
data[indices] = np.take(col_mean, indices[1])

print(data)

Verification:

print("Any missing values left?", np.isnan(data).any())

Output: False — which confirms we’ve replaced all missing values.

Step 4: Normalize or Scale the Data

Scaling brings values into a uniform range, which is important for ML and statistical modeling.

# Min-Max normalization
data_min = data.min(axis=0)
data_max = data.max(axis=0)

normalized_data = (data - data_min) / (data_max - data_min)
print(normalized_data)

All values are now between 0 and 1. This removes bias caused by different units or scales across columns.

Step 5: Data Type Conversion

Often, you’ll need to cast values to a consistent data type, especially when preparing inputs for a model.

converted_data = data.astype('float64')
print("New dtype:", converted_data.dtype)

Always verify the dtype after conversion. Inconsistent types can lead to silent bugs or runtime errors.

Final Validation Checks

Shape check: data.shape should match your expectations (rows x columns).
NaN check: np.isnan(data).any() should be False.
Range check: np.min(data), np.max(data) should fall within expected bounds after scaling.

Conclusion

Data preprocessing is where the real work begins. Clean, consistent data is the foundation of reliable analysis and modeling. With NumPy’s fast and expressive syntax, you can wrangle messy datasets into structured form — ready for insights.

In the next lesson, we’ll explore how to work with structured CSV data, and even pair NumPy arrays with Pandas DataFrames.

⬅ Previous TopicRead and Write Data from TXT & CSV Files Using NumPy

Next Topic ⮕Vectorization Techniques in NumPy

Course Index0
❯

Module 1: Introduction to NumPy4
❯

Module 2: NumPy Arrays - Basics7
❯

$Module 3: Array Operations$ Module 3: Array Operations6
❯

Module 4: Linear Algebra with NumPy7
❯

Module 5: Array Reshaping and Manipulation7
❯

Module 6: Advanced Indexing and Masking3
❯

Module 7: Useful NumPy Utilities5
❯

Module 8: Working with Missing or Invalid Data4
❯

Module 9: NumPy with Real Data2
❯

Module 10: Performance Optimization3
❯

Module 11: NumPy + SciPy + Pandas4
❯

Data Preprocessing with NumPy
Clean and Prepare Real-World Data

Introduction

Step 1: Load the Data into a NumPy Array

Explanation:

Step 2: Identify Missing or Invalid Entries

Step 3: Handle Missing Data

Option 1: Remove Rows with Missing Data

Option 2: Replace Missing Data with Column Means

Verification:

Step 4: Normalize or Scale the Data

Step 5: Data Type Conversion

Final Validation Checks

Conclusion

Module 9: NumPy with Real Data❯

Support ProgramGuru.org❯

Data Preprocessing with NumPyClean and Prepare Real-World Data

Introduction

Step 1: Load the Data into a NumPy Array

Explanation:

Step 2: Identify Missing or Invalid Entries

Step 3: Handle Missing Data

Option 1: Remove Rows with Missing Data

Option 2: Replace Missing Data with Column Means

Verification:

Step 4: Normalize or Scale the Data

Step 5: Data Type Conversion

Final Validation Checks

Conclusion

Module 9: NumPy with Real Data❯

Welcome to ProgramGuru

Support ProgramGuru.org❯

Player Settings

Data Preprocessing with NumPy
Clean and Prepare Real-World Data