Machine Learning for BeginnersMachine Learning for Beginners1

Encoding Categorical Data in Machine Learning (Beginner-Friendly Guide)



Why Do We Need to Encode Categorical Data?

In Machine Learning, most algorithms work only with numerical data. However, real-world datasets often contain categorical features such as country names, gender, product categories, etc.

Since ML models can't process strings directly, we must convert categorical data into numbers. This process is called encoding.

🧠 Question:

Why can't we just feed "Red", "Green", "Blue" as-is to a machine learning model?

Answer:

Because ML models understand only numbers. If we don’t convert these values, the model won’t be able to compute distances or find patterns effectively.


Common Encoding Techniques

1. Label Encoding

This method converts each category into a unique integer value. Useful when the categorical variable has a natural order (ordinal data).

Example:

Suppose we have a "Size" column:

Size
----
Small
Medium
Large

Using label encoding:

Small  → 0
Medium → 1
Large  → 2

🧠 Question:

Is this encoding suitable for colors like Red, Blue, Green?

❌ Answer:

No, because colors have no inherent order. Encoding them like this might mislead the model to think Green > Blue > Red, which is wrong.


2. One-Hot Encoding

This method creates separate binary (0/1) columns for each category. Ideal for nominal data (no order).

Example:

For a "Color" column with values Red, Green, Blue:

Original:
Color
-----
Red
Green
Blue
Green

One-Hot Encoded:
Red  Green  Blue
1     0      0
0     1      0
0     0      1
0     1      0

This avoids introducing false ordering.


Python Code: Encoding Categorical Data

Sample Dataset

Let’s say we have a dataset of people with their country and gender:

Name Country Gender
Alice India Female
Bob USA Male
Charlie UK Male
Diana India Female

Goal:

Convert 'Country' and 'Gender' columns into numeric format using Label Encoding and One-Hot Encoding.

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Country': ['India', 'USA', 'UK', 'India'],
    'Gender': ['Female', 'Male', 'Male', 'Female']
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Label Encoding Gender (since it's binary, label encoding is fine)
label_encoder = LabelEncoder()
df['Gender_encoded'] = label_encoder.fit_transform(df['Gender'])

# One-Hot Encoding Country
ct = ColumnTransformer(
    transformers=[('encoder', OneHotEncoder(), ['Country'])],
    remainder='passthrough'
)

df_encoded = ct.fit_transform(df[['Country', 'Gender_encoded']])
encoded_df = pd.DataFrame(df_encoded.toarray() if hasattr(df_encoded, 'toarray') else df_encoded)

print("\nAfter Encoding:")
print(encoded_df)

Explanation of Code

🧠 Question:

Why do we use both LabelEncoder and OneHotEncoder?

Answer:

We use LabelEncoder for binary features (like Gender) where one column is enough. For non-binary categorical features like Country, One-Hot avoids implying any order.


Key Takeaways

Now You Try:

Try encoding the following data using One-Hot Encoding:

Fruit
-----
Apple
Banana
Orange
Banana

✍️ Your Turn (Answer):

Apple  Banana  Orange
1       0       0
0       1       0
0       0       1
0       1       0


Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M