Yandex

Course IndexCourse Index0

    ProgramGuru

    Data Preprocessing with NumPy
    Clean and Prepare Real-World Data


    Introduction

    Real-world data is rarely clean. Missing values, inconsistent formats, and outliers are common. Before any analysis or modeling, we must preprocess the data — and NumPy offers powerful tools to do this efficiently.

    This tutorial will guide you step-by-step through preprocessing a dataset using NumPy. We’ll focus on practical tasks like loading data, handling missing values, converting data types, and verifying integrity.

    Step 1: Load the Data into a NumPy Array

    You can load external data using numpy.loadtxt() or numpy.genfromtxt(). The latter is especially useful when dealing with missing or incomplete values.

    import numpy as np
    
    # Simulated CSV data with missing values
    data = np.genfromtxt('sample-data.csv', delimiter=',', dtype='float32', skip_header=1, filling_values=np.nan)
    
    print(data)

    Explanation:

    • delimiter=',' tells NumPy to expect comma-separated values.
    • skip_header=1 ignores the first row (usually column headers).
    • filling_values=np.nan ensures missing values are represented as np.nan (Not a Number).

    Output: A 2D array where missing values are marked as nan.

    Step 2: Identify Missing or Invalid Entries

    Now that we’ve loaded the data, we need to find and quantify the missing values.

    # Count missing values column-wise
    missing_per_column = np.isnan(data).sum(axis=0)
    print("Missing values per column:", missing_per_column)

    This gives you a quick overview of where the data needs attention.

    Step 3: Handle Missing Data

    There are multiple strategies to handle missing data:

    Option 1: Remove Rows with Missing Data

    cleaned_data = data[~np.isnan(data).any(axis=1)]
    print(cleaned_data)

    This line filters out any row that has even a single missing value. It’s simple — but risky if too much data is lost.

    Option 2: Replace Missing Data with Column Means

    # Replace NaN with column mean
    col_mean = np.nanmean(data, axis=0)
    indices = np.where(np.isnan(data))
    data[indices] = np.take(col_mean, indices[1])
    
    print(data)

    Verification:

    print("Any missing values left?", np.isnan(data).any())

    Output: False — which confirms we’ve replaced all missing values.

    Step 4: Normalize or Scale the Data

    Scaling brings values into a uniform range, which is important for ML and statistical modeling.

    # Min-Max normalization
    data_min = data.min(axis=0)
    data_max = data.max(axis=0)
    
    normalized_data = (data - data_min) / (data_max - data_min)
    print(normalized_data)

    All values are now between 0 and 1. This removes bias caused by different units or scales across columns.

    Step 5: Data Type Conversion

    Often, you’ll need to cast values to a consistent data type, especially when preparing inputs for a model.

    converted_data = data.astype('float64')
    print("New dtype:", converted_data.dtype)

    Always verify the dtype after conversion. Inconsistent types can lead to silent bugs or runtime errors.

    Final Validation Checks

    • Shape check: data.shape should match your expectations (rows x columns).
    • NaN check: np.isnan(data).any() should be False.
    • Range check: np.min(data), np.max(data) should fall within expected bounds after scaling.

    Conclusion

    Data preprocessing is where the real work begins. Clean, consistent data is the foundation of reliable analysis and modeling. With NumPy’s fast and expressive syntax, you can wrangle messy datasets into structured form — ready for insights.

    In the next lesson, we’ll explore how to work with structured CSV data, and even pair NumPy arrays with Pandas DataFrames.



    Welcome to ProgramGuru

    Sign up to start your journey with us

    Support ProgramGuru.org

    You can support this website with a contribution of your choice.

    When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.

    PayPal

    UPI

    PhonePe QR

    MALLIKARJUNA M