Why Do We Need to Encode Categorical Data?
In Machine Learning, most algorithms work only with numerical data. However, real-world datasets often contain categorical features such as country names, gender, product categories, etc.
Since ML models can't process strings directly, we must convert categorical data into numbers. This process is called encoding.
🧠 Question:
Why can't we just feed "Red", "Green", "Blue" as-is to a machine learning model?
Answer:
Because ML models understand only numbers. If we don’t convert these values, the model won’t be able to compute distances or find patterns effectively.
Common Encoding Techniques
1. Label Encoding
This method converts each category into a unique integer value. Useful when the categorical variable has a natural order (ordinal data).
Example:
Suppose we have a "Size" column:
Size ---- Small Medium Large
Using label encoding:
Small → 0 Medium → 1 Large → 2
🧠 Question:
Is this encoding suitable for colors like Red, Blue, Green?
❌ Answer:
No, because colors have no inherent order. Encoding them like this might mislead the model to think Green > Blue > Red, which is wrong.
2. One-Hot Encoding
This method creates separate binary (0/1) columns for each category. Ideal for nominal data (no order).
Example:
For a "Color" column with values Red, Green, Blue:
Original: Color ----- Red Green Blue Green One-Hot Encoded: Red Green Blue 1 0 0 0 1 0 0 0 1 0 1 0
This avoids introducing false ordering.
Python Code: Encoding Categorical Data
Sample Dataset
Let’s say we have a dataset of people with their country and gender:
Name | Country | Gender |
---|---|---|
Alice | India | Female |
Bob | USA | Male |
Charlie | UK | Male |
Diana | India | Female |
Goal:
Convert 'Country' and 'Gender' columns into numeric format using Label Encoding and One-Hot Encoding.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
# Sample data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Country': ['India', 'USA', 'UK', 'India'],
'Gender': ['Female', 'Male', 'Male', 'Female']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Label Encoding Gender (since it's binary, label encoding is fine)
label_encoder = LabelEncoder()
df['Gender_encoded'] = label_encoder.fit_transform(df['Gender'])
# One-Hot Encoding Country
ct = ColumnTransformer(
transformers=[('encoder', OneHotEncoder(), ['Country'])],
remainder='passthrough'
)
df_encoded = ct.fit_transform(df[['Country', 'Gender_encoded']])
encoded_df = pd.DataFrame(df_encoded.toarray() if hasattr(df_encoded, 'toarray') else df_encoded)
print("\nAfter Encoding:")
print(encoded_df)
Explanation of Code
- Pandas is used to create and manipulate the dataframe.
- LabelEncoder is applied to the 'Gender' column (Male → 1, Female → 0).
- OneHotEncoder is applied to the 'Country' column to avoid false ordinal relationships.
- ColumnTransformer is used to apply encoding only on selected columns.
🧠 Question:
Why do we use both LabelEncoder and OneHotEncoder?
Answer:
We use LabelEncoder for binary features (like Gender) where one column is enough. For non-binary categorical features like Country, One-Hot avoids implying any order.
Key Takeaways
- Always encode categorical variables before feeding into ML models.
- Use Label Encoding only when the categories have an order.
- Use One-Hot Encoding when categories are unordered (nominal).
Now You Try:
Try encoding the following data using One-Hot Encoding:
Fruit ----- Apple Banana Orange Banana
✍️ Your Turn (Answer):
Apple Banana Orange 1 0 0 0 1 0 0 0 1 0 1 0