You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.
Feature scaling is a technique used to bring all features in your dataset onto a similar scale, especially when features have different units (e.g., age in years, salary in dollars).
Imagine you're comparing two car features to predict fuel efficiency:
Many machine learning algorithms calculate distances (e.g., K-Nearest Neighbors, SVM, KMeans) or assume features are on a similar scale (e.g., Logistic Regression, Gradient Descent). Without scaling:
Why not just leave the features as they are? After all, larger numbers don't always mean more importance, right?
True, but algorithms cannot distinguish between value magnitude due to importance vs. due to units. Scaling neutralizes this issue by normalizing magnitudes, not relationships.
Transforms values to a fixed range, usually [0, 1].
Formula:
X_scaled = (X - X_min) / (X_max - X_min)
Use case: When you know data distribution is not Gaussian and you want a bounded scale (e.g., Neural Networks often prefer 0–1 range).
Height (cm): [160, 170, 180] Min = 160, Max = 180 => Normalized values: (160-160)/(180-160) = 0 (170-160)/(180-160) = 0.5 (180-160)/(180-160) = 1
Transforms data to have zero mean and unit variance.
Formula:
Z = (X - mean) / std_dev
Use case: Works well when the data follows a Gaussian distribution (bell curve). Recommended for algorithms like Logistic Regression, SVM, and PCA.
Scores: [50, 60, 70] Mean = 60, Std Dev = 10 => Z-scores: (50-60)/10 = -1 (60-60)/10 = 0 (70-60)/10 = +1
Let's apply both MinMaxScaler and StandardScaler on sample data using scikit-learn.
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import pandas as pd
# Sample dataset
data = {
'Age': [25, 32, 47, 51, 62],
'Salary': [40000, 50000, 62000, 70000, 80000]
}
df = pd.DataFrame(data)
# 1. Normalization (MinMax Scaling)
minmax_scaler = MinMaxScaler()
df_minmax = pd.DataFrame(minmax_scaler.fit_transform(df), columns=df.columns)
# 2. Standardization (Z-score Scaling)
standard_scaler = StandardScaler()
df_standard = pd.DataFrame(standard_scaler.fit_transform(df), columns=df.columns)
print("Original Data:")
print(df)
print("\nMin-Max Scaled:")
print(df_minmax)
print("\nStandard Scaled:")
print(df_standard)
Original Data: Age Salary 0 25 40000 1 32 50000 2 47 62000 3 51 70000 4 62 80000 Min-Max Scaled: Age Salary 0 0.000000 0.000000 1 0.194444 0.222222 2 0.611111 0.488889 3 0.722222 0.666667 4 1.000000 1.000000 Standard Scaled: Age Salary 0 -1.411972 -1.414214 1 -0.806046 -0.707107 2 0.237581 0.000000 3 0.649384 0.565686 4 1.331053 1.555734
fit_transform()
to compute statistics and apply scaling at once.You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.