Customer Segmentation Using Clustering - Machine Learning Project for Beginners

Project: Customer Segmentation Using Clustering

Customer segmentation is a fundamental marketing strategy that divides a customer base into groups of individuals with similar characteristics. In this project, we'll use K-Means Clustering—a popular unsupervised machine learning algorithm—to segment customers based on their behavior.

Real-Life Problem Statement

A retail store wants to segment its customers to offer personalized marketing. They have a dataset of customer details like:

Annual Income
Spending Score (a score given based on shopping behavior)

By clustering customers, we can group them into categories like:

High Income, High Spend
Low Income, High Spend
High Income, Low Spend
Low Income, Low Spend

What is Clustering?

Clustering is an unsupervised learning technique where the goal is to group similar data points together.

K-Means Clustering partitions the dataset into k clusters, where each data point belongs to the cluster with the nearest mean.

What kind of learning is this?

It’s unsupervised learning, because we don’t have predefined labels. We’re discovering structure from data.

Why use clustering instead of classification?

Because we don’t have labeled output like “Customer Type A” or “Customer Type B”. We let the algorithm find natural groupings.

Dataset Used

We’ll use the Mall_Customers.csv dataset, which contains the following columns:

CustomerID
Gender
Age
Annual Income (k$)
Spending Score (1–100)

Step-by-Step Code with Explanations

# Step 1: Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Step 2: Load the dataset
df = pd.read_csv('Mall_Customers.csv')

# Step 3: Select features for clustering
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]

Shape of X: (200, 2)

We choose just two features to visualize the clusters easily: Annual Income and Spending Score.

How to choose the number of clusters (k)?

We use the Elbow Method: plot the Within-Cluster Sum of Squares (WCSS) for different values of k. The “elbow point” indicates the optimal number of clusters.

# Step 4: Find the optimal number of clusters using Elbow Method
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# Plot the elbow graph
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()

The "elbow" typically appears at k = 5

Apply K-Means with k = 5

# Step 5: Apply K-Means with k = 5
kmeans = KMeans(n_clusters=5, random_state=42)
y_kmeans = kmeans.fit_predict(X)

y_kmeans is an array of cluster labels (0 to 4) assigned to each customer.

Visualize the Clusters

# Step 6: Visualize the clusters
plt.figure(figsize=(8, 5))
plt.scatter(X.iloc[y_kmeans==0, 0], X.iloc[y_kmeans==0, 1], s=100, label='Cluster 1')
plt.scatter(X.iloc[y_kmeans==1, 0], X.iloc[y_kmeans==1, 1], s=100, label='Cluster 2')
plt.scatter(X.iloc[y_kmeans==2, 0], X.iloc[y_kmeans==2, 1], s=100, label='Cluster 3')
plt.scatter(X.iloc[y_kmeans==3, 0], X.iloc[y_kmeans==3, 1], s=100, label='Cluster 4')
plt.scatter(X.iloc[y_kmeans==4, 0], X.iloc[y_kmeans==4, 1], s=100, label='Cluster 5')

# Plot centroids
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], 
            s=300, c='black', marker='X', label='Centroids')

plt.title('Customer Segments')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.grid(True)
plt.show()

This plot shows 5 clusters with centroids. You can visually identify:

Low income, low spending
High income, high spending
Moderate income, moderate spending
Low income, high spending (could be impulsive buyers!)
High income, low spending (maybe potential premium buyers)

Final Summary

In this project, we:

Loaded and prepared customer data
Used K-Means clustering to group customers
Chose optimal k using the elbow method
Visualized customer segments for better marketing strategy

What can a business do with these segments?

They can create targeted campaigns: offer luxury services to high-income/high-spend customers and loyalty programs for low-income/high-spend customers.

Can we use more features in clustering?

Absolutely! The more features you use, the better your clustering could be—though you’ll need dimensionality reduction techniques for visualization (like PCA).

Congratulations! You've built a complete unsupervised ML project from scratch!

⬅ Previous TopicHouse Price Prediction Using Machine Learning (Step-by-Step Tutorial for Beginners)

Next Topic ⮕Email Spam Detection using Machine Learning

Comments

Loading comments...