Machine Learning for BeginnersMachine Learning for Beginners1

Machine Learning - Feature Scaling and Normalization



What is Feature Scaling in Machine Learning?

Feature scaling is a technique used to bring all features in your dataset onto a similar scale, especially when features have different units (e.g., age in years, salary in dollars).

Real-life Analogy

Imagine you're comparing two car features to predict fuel efficiency:

Without scaling, models like KNN or Gradient Descent may consider Engine Size more important just because it has larger values — even if it's not actually more impactful.

Why do we need Feature Scaling?

Many machine learning algorithms calculate distances (e.g., K-Nearest Neighbors, SVM, KMeans) or assume features are on a similar scale (e.g., Logistic Regression, Gradient Descent). Without scaling:

🧠 Question:

Why not just leave the features as they are? After all, larger numbers don't always mean more importance, right?

Answer:

True, but algorithms cannot distinguish between value magnitude due to importance vs. due to units. Scaling neutralizes this issue by normalizing magnitudes, not relationships.


⚙️ Two Common Techniques: Normalization vs Standardization

1. Normalization (Min-Max Scaling)

Transforms values to a fixed range, usually [0, 1].
Formula: X_scaled = (X - X_min) / (X_max - X_min)

Use case: When you know data distribution is not Gaussian and you want a bounded scale (e.g., Neural Networks often prefer 0–1 range).

Example:

Height (cm): [160, 170, 180]
Min = 160, Max = 180

=> Normalized values:
(160-160)/(180-160) = 0
(170-160)/(180-160) = 0.5
(180-160)/(180-160) = 1

2. Standardization (Z-score Scaling)

Transforms data to have zero mean and unit variance.
Formula: Z = (X - mean) / std_dev

Use case: Works well when the data follows a Gaussian distribution (bell curve). Recommended for algorithms like Logistic Regression, SVM, and PCA.

Example:

Scores: [50, 60, 70]
Mean = 60, Std Dev = 10

=> Z-scores:
(50-60)/10 = -1
(60-60)/10 =  0
(70-60)/10 = +1

Python Example: Scaling Real Data

Let's apply both MinMaxScaler and StandardScaler on sample data using scikit-learn.

from sklearn.preprocessing import MinMaxScaler, StandardScaler
import pandas as pd

# Sample dataset
data = {
    'Age': [25, 32, 47, 51, 62],
    'Salary': [40000, 50000, 62000, 70000, 80000]
}

df = pd.DataFrame(data)

# 1. Normalization (MinMax Scaling)
minmax_scaler = MinMaxScaler()
df_minmax = pd.DataFrame(minmax_scaler.fit_transform(df), columns=df.columns)

# 2. Standardization (Z-score Scaling)
standard_scaler = StandardScaler()
df_standard = pd.DataFrame(standard_scaler.fit_transform(df), columns=df.columns)

print("Original Data:")
print(df)
print("\nMin-Max Scaled:")
print(df_minmax)
print("\nStandard Scaled:")
print(df_standard)

Output:

Original Data:
   Age  Salary
0   25   40000
1   32   50000
2   47   62000
3   51   70000
4   62   80000

Min-Max Scaled:
        Age    Salary
0  0.000000  0.000000
1  0.194444  0.222222
2  0.611111  0.488889
3  0.722222  0.666667
4  1.000000  1.000000

Standard Scaled:
        Age    Salary
0 -1.411972 -1.414214
1 -0.806046 -0.707107
2  0.237581  0.000000
3  0.649384  0.565686
4  1.331053  1.555734

Code Explanation

Which scaling method should I use?

Summary

Quick Recap Quiz

  1. Why does feature scaling help KNN or SVM?
  2. What is the difference between normalization and standardization?
  3. What happens if you apply different scalers on training and test data?

📝 Answers:

  1. These models use distance metrics; scaling ensures fair contribution from all features.
  2. Normalization scales data to [0,1]; Standardization shifts it to zero mean, unit variance.
  3. Your model will see inconsistent patterns and likely perform poorly (data leakage).


Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M