Project: Customer Segmentation Using Clustering
Customer segmentation is a fundamental marketing strategy that divides a customer base into groups of individuals with similar characteristics. In this project, we'll use K-Means Clustering—a popular unsupervised machine learning algorithm—to segment customers based on their behavior.
---Real-Life Problem Statement
A retail store wants to segment its customers to offer personalized marketing. They have a dataset of customer details like:
- Annual Income
- Spending Score (a score given based on shopping behavior)
By clustering customers, we can group them into categories like:
- High Income, High Spend
- Low Income, High Spend
- High Income, Low Spend
- Low Income, Low Spend
What is Clustering?
Clustering is an unsupervised learning technique where the goal is to group similar data points together.
K-Means Clustering partitions the dataset into k
clusters, where each data point belongs to the cluster with the nearest mean.
What kind of learning is this?
It’s unsupervised learning, because we don’t have predefined labels. We’re discovering structure from data.
Why use clustering instead of classification?
Because we don’t have labeled output like “Customer Type A” or “Customer Type B”. We let the algorithm find natural groupings.
---Dataset Used
We’ll use the Mall_Customers.csv
dataset, which contains the following columns:
CustomerID
Gender
Age
Annual Income (k$)
Spending Score (1–100)
Step-by-Step Code with Explanations
# Step 1: Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Step 2: Load the dataset
df = pd.read_csv('Mall_Customers.csv')
# Step 3: Select features for clustering
X = df[['Annual Income (k$)', 'Spending Score (1-100)']]
Shape of X: (200, 2)
We choose just two features to visualize the clusters easily: Annual Income and Spending Score.
---How to choose the number of clusters (k)?
We use the Elbow Method: plot the Within-Cluster Sum of Squares (WCSS)
for different values of k
. The “elbow point” indicates the optimal number of clusters.
# Step 4: Find the optimal number of clusters using Elbow Method
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# Plot the elbow graph
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()
The "elbow" typically appears at k = 5---
Apply K-Means with k = 5
# Step 5: Apply K-Means with k = 5
kmeans = KMeans(n_clusters=5, random_state=42)
y_kmeans = kmeans.fit_predict(X)
y_kmeans is an array of cluster labels (0 to 4) assigned to each customer.
---Visualize the Clusters
# Step 6: Visualize the clusters
plt.figure(figsize=(8, 5))
plt.scatter(X.iloc[y_kmeans==0, 0], X.iloc[y_kmeans==0, 1], s=100, label='Cluster 1')
plt.scatter(X.iloc[y_kmeans==1, 0], X.iloc[y_kmeans==1, 1], s=100, label='Cluster 2')
plt.scatter(X.iloc[y_kmeans==2, 0], X.iloc[y_kmeans==2, 1], s=100, label='Cluster 3')
plt.scatter(X.iloc[y_kmeans==3, 0], X.iloc[y_kmeans==3, 1], s=100, label='Cluster 4')
plt.scatter(X.iloc[y_kmeans==4, 0], X.iloc[y_kmeans==4, 1], s=100, label='Cluster 5')
# Plot centroids
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=300, c='black', marker='X', label='Centroids')
plt.title('Customer Segments')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.grid(True)
plt.show()
This plot shows 5 clusters with centroids. You can visually identify:
- Low income, low spending
- High income, high spending
- Moderate income, moderate spending
- Low income, high spending (could be impulsive buyers!)
- High income, low spending (maybe potential premium buyers)
Final Summary
In this project, we:
- Loaded and prepared customer data
- Used K-Means clustering to group customers
- Chose optimal
k
using the elbow method - Visualized customer segments for better marketing strategy
What can a business do with these segments?
They can create targeted campaigns: offer luxury services to high-income/high-spend customers and loyalty programs for low-income/high-spend customers.
Can we use more features in clustering?
Absolutely! The more features you use, the better your clustering could be—though you’ll need dimensionality reduction techniques for visualization (like PCA).
Congratulations! 🎉 You've built a complete unsupervised ML project from scratch!