K-Means Clustering in Machine Learning
K-Means Clustering is an unsupervised learning algorithm used to group similar data points into k number of clusters. Unlike supervised learning, K-Means doesn’t need labeled data. It tries to find natural groupings in the dataset based on feature similarity.
Real-Life Analogy
Imagine you run a shopping mall and you want to segment customers into groups based on their shopping habits — like "budget shoppers", "premium buyers", and "window shoppers". You don’t know how many categories there are, but you want the data to tell you. That’s where K-Means helps — it finds these groupings for you.
How K-Means Works: Step-by-Step
- Choose the number of clusters
k
. - Randomly initialize
k
centroids (cluster centers). - Assign each data point to the nearest centroid.
- Recalculate centroids as the mean of points in each cluster.
- Repeat steps 3–4 until centroids do not change (or change very little).
Example: Clustering 2D Data
Let’s say we have data points representing the income and spending score of mall customers. We want to group similar customers together.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Sample data: [Annual Income, Spending Score]
data = np.array([
[15, 39], [15, 81], [16, 6], [16, 77], [17, 40],
[17, 76], [18, 6], [18, 94], [19, 3], [19, 72],
[20, 14], [20, 99], [21, 15], [21, 77], [23, 35],
[23, 98], [24, 35], [24, 73], [25, 5], [25, 73]
])
# Apply KMeans with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(data)
# Get cluster centers and labels
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
# Plot the result
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label='Centroids')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.title('Customer Segmentation using K-Means')
plt.legend()
plt.grid(True)
plt.show()
A scatter plot will be shown with 3 colored clusters and red 'X' marks as centroids.
🧠 Intuition Questions
Why do we need to specify the number of clusters k
?
➤ Because K-Means is not smart enough to guess the number of groupings by itself — it needs us to tell it. However, we can try different values of k
and evaluate which one works best using the Elbow Method.
How are initial centroids chosen?
➤ Randomly. That’s why different runs may give slightly different results. Using random_state
helps to make the result reproducible.
What is the Elbow Method?
The Elbow Method helps us decide the right value of k
(number of clusters). It plots the "Within-Cluster Sum of Squares (WCSS)" for different k
values. The point where the curve bends (like an elbow) is the optimal k
.
wcss = []
for k in range(1, 10):
km = KMeans(n_clusters=k, random_state=0)
km.fit(data)
wcss.append(km.inertia_)
plt.plot(range(1, 10), wcss, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('WCSS')
plt.title('Elbow Method for Optimal k')
plt.grid(True)
plt.show()
A line chart is plotted. The 'elbow' point shows the optimal number of clusters.
🧠 More Intuition Questions
What does WCSS measure?
➤ WCSS is the sum of squared distances from each point to its cluster’s centroid. Lower WCSS means tighter clusters.
Why does the elbow method work?
➤ Initially, adding more clusters improves the model a lot. But after a point, the gain is small. The elbow shows the point of diminishing returns.
Summary
- K-Means is an unsupervised algorithm that groups similar data points into
k
clusters. - We must choose
k
manually, and the Elbow Method can help in selecting the best value. - Scikit-learn’s
KMeans
makes implementation easy.
Practice Task for You
Use the Iris dataset from sklearn and apply K-Means clustering. Try different values of k
and visualize the clusters.