Clustering with KMeans in Spark MLlib

What is Clustering?

Clustering is an unsupervised machine learning technique used to group similar data points together based on their features. Unlike classification, clustering doesn’t need labeled data — it tries to discover the structure hidden within the data itself.

Understanding KMeans Clustering

KMeans is the most popular clustering algorithm. It divides the data into K clusters, where each data point belongs to the cluster with the nearest mean (also called centroid).

How KMeans Works: Step-by-Step

Choose the number of clusters K.
Initialize K random points as centroids.
Assign each point to the nearest centroid (cluster assignment).
Recalculate centroids as the average of all points in a cluster.
Repeat steps 3–4 until cluster assignments no longer change (convergence).

Real-Life Example: Customer Segmentation

Imagine you have data on customer behavior: how much they spend, how frequently they purchase, and what kind of products they prefer. You want to group similar customers so you can run targeted marketing campaigns.

KMeans clustering will help group customers into segments like:

High-value, frequent buyers
Occasional shoppers
New or one-time customers

Question:

Why don’t we label the customers manually and classify them?

Answer:

Because in real scenarios, we often don’t have labeled data. Clustering lets the algorithm find natural groupings based only on the data's structure.

KMeans in PySpark (Spark MLlib)

Let’s implement a simple KMeans example using PySpark. We’ll cluster synthetic data points in 2D space.

Step 1: Setup and Sample Data

from pyspark.sql import SparkSession
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler

# Create Spark session
spark = SparkSession.builder.appName("KMeansClustering").getOrCreate()

# Sample 2D data (x, y)
data = [
    (1.0, 1.0), (1.5, 1.8), (5.0, 8.0),
    (8.0, 8.0), (1.0, 0.6), (9.0, 11.0),
    (8.0, 2.0), (10.0, 2.0), (9.0, 3.0)
]

# Create DataFrame
df = spark.createDataFrame(data, ["x", "y"])
df.show()

+----+----+
|   x|   y|
+----+----+
| 1.0| 1.0|
| 1.5| 1.8|
| 5.0| 8.0|
| 8.0| 8.0|
| 1.0| 0.6|
| 9.0|11.0|
| 8.0| 2.0|
|10.0| 2.0|
| 9.0| 3.0|
+----+----+

Step 2: Assemble Features

# Convert x and y into a single features column
vec_assembler = VectorAssembler(inputCols=["x", "y"], outputCol="features")
final_data = vec_assembler.transform(df)

Step 3: Apply KMeans Algorithm

# Apply KMeans with 3 clusters
kmeans = KMeans(k=3, seed=1)
model = kmeans.fit(final_data)

# Make predictions
predictions = model.transform(final_data)
predictions.select("x", "y", "prediction").show()

+----+----+----------+
|   x|   y|prediction|
+----+----+----------+
| 1.0| 1.0|         1|
| 1.5| 1.8|         1|
| 5.0| 8.0|         0|
| 8.0| 8.0|         0|
| 1.0| 0.6|         1|
| 9.0|11.0|         0|
| 8.0| 2.0|         2|
|10.0| 2.0|         2|
| 9.0| 3.0|         2|
+----+----+----------+

Step 4: View Cluster Centers

# Print cluster centers
centers = model.clusterCenters()
print("Cluster Centers:")
for idx, center in enumerate(centers):
    print(f"Cluster {idx}: {center}")

Cluster Centers:
Cluster 0: [7.333333333333333, 9.0]
Cluster 1: [1.1666666666666667, 1.1333333333333333]
Cluster 2: [9.0, 2.3333333333333335]

Intuition Check

Question:

Why did we choose K=3 clusters?

Answer:

We looked at the data and assumed it may naturally fall into 3 groups. In real scenarios, you can use the "elbow method" to find the optimal value of K by plotting cluster inertia (within-cluster sum of squares) against K.

Summary

KMeans is a simple yet powerful clustering algorithm. In Spark, it's easy to apply even on large-scale data using MLlib. You now understand how to cluster similar data points and interpret cluster assignments and centroids using PySpark.

What is Clustering?

Understanding KMeans Clustering

KMeans is the most popular clustering algorithm. It divides the data into K clusters, where each data point belongs to the cluster with the nearest mean (also called centroid).

How KMeans Works: Step-by-Step

Choose the number of clusters K.
Initialize K random points as centroids.
Assign each point to the nearest centroid (cluster assignment).
Recalculate centroids as the average of all points in a cluster.
Repeat steps 3–4 until cluster assignments no longer change (convergence).

Real-Life Example: Customer Segmentation

KMeans clustering will help group customers into segments like:

High-value, frequent buyers
Occasional shoppers
New or one-time customers

Question:

Why don’t we label the customers manually and classify them?

Answer:

Because in real scenarios, we often don’t have labeled data. Clustering lets the algorithm find natural groupings based only on the data's structure.

KMeans in PySpark (Spark MLlib)

Let’s implement a simple KMeans example using PySpark. We’ll cluster synthetic data points in 2D space.

Step 1: Setup and Sample Data

from pyspark.sql import SparkSession
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler

# Create Spark session
spark = SparkSession.builder.appName("KMeansClustering").getOrCreate()

# Sample 2D data (x, y)
data = [
    (1.0, 1.0), (1.5, 1.8), (5.0, 8.0),
    (8.0, 8.0), (1.0, 0.6), (9.0, 11.0),
    (8.0, 2.0), (10.0, 2.0), (9.0, 3.0)
]

# Create DataFrame
df = spark.createDataFrame(data, ["x", "y"])
df.show()

+----+----+
|   x|   y|
+----+----+
| 1.0| 1.0|
| 1.5| 1.8|
| 5.0| 8.0|
| 8.0| 8.0|
| 1.0| 0.6|
| 9.0|11.0|
| 8.0| 2.0|
|10.0| 2.0|
| 9.0| 3.0|
+----+----+

Step 2: Assemble Features

# Convert x and y into a single features column
vec_assembler = VectorAssembler(inputCols=["x", "y"], outputCol="features")
final_data = vec_assembler.transform(df)

Step 3: Apply KMeans Algorithm

# Apply KMeans with 3 clusters
kmeans = KMeans(k=3, seed=1)
model = kmeans.fit(final_data)

# Make predictions
predictions = model.transform(final_data)
predictions.select("x", "y", "prediction").show()

+----+----+----------+
|   x|   y|prediction|
+----+----+----------+
| 1.0| 1.0|         1|
| 1.5| 1.8|         1|
| 5.0| 8.0|         0|
| 8.0| 8.0|         0|
| 1.0| 0.6|         1|
| 9.0|11.0|         0|
| 8.0| 2.0|         2|
|10.0| 2.0|         2|
| 9.0| 3.0|         2|
+----+----+----------+

Step 4: View Cluster Centers

# Print cluster centers
centers = model.clusterCenters()
print("Cluster Centers:")
for idx, center in enumerate(centers):
    print(f"Cluster {idx}: {center}")

Cluster Centers:
Cluster 0: [7.333333333333333, 9.0]
Cluster 1: [1.1666666666666667, 1.1333333333333333]
Cluster 2: [9.0, 2.3333333333333335]

Intuition Check

Question:

Why did we choose K=3 clusters?

Answer:

Summary

⬅ Previous TopicLinear Regression in Spark MLlib

Next Topic ⮕Movie Ratings Project with Apache Spark

Comments

Loading comments...

Clustering with KMeans in Spark MLlib

What is Clustering?

Understanding KMeans Clustering

How KMeans Works: Step-by-Step

Real-Life Example: Customer Segmentation

Question:

Answer:

KMeans in PySpark (Spark MLlib)

Step 1: Setup and Sample Data

Step 2: Assemble Features

Step 3: Apply KMeans Algorithm

Step 4: View Cluster Centers

Intuition Check

Question:

Answer:

Summary

What is Clustering?

Understanding KMeans Clustering

How KMeans Works: Step-by-Step

Real-Life Example: Customer Segmentation

Question:

Answer:

KMeans in PySpark (Spark MLlib)

Step 1: Setup and Sample Data

Step 2: Assemble Features

Step 3: Apply KMeans Algorithm

Step 4: View Cluster Centers

Intuition Check

Question:

Answer:

Summary

Comments

Module 11: Introduction to Machine Learning with Spark MLlib❯

Welcome to ProgramGuru

Player Settings