Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Clustering with KMeans in Spark MLlib



What is Clustering?

Clustering is an unsupervised machine learning technique used to group similar data points together based on their features. Unlike classification, clustering doesn’t need labeled data — it tries to discover the structure hidden within the data itself.

Understanding KMeans Clustering

KMeans is the most popular clustering algorithm. It divides the data into K clusters, where each data point belongs to the cluster with the nearest mean (also called centroid).

How KMeans Works: Step-by-Step

  1. Choose the number of clusters K.
  2. Initialize K random points as centroids.
  3. Assign each point to the nearest centroid (cluster assignment).
  4. Recalculate centroids as the average of all points in a cluster.
  5. Repeat steps 3–4 until cluster assignments no longer change (convergence).

Real-Life Example: Customer Segmentation

Imagine you have data on customer behavior: how much they spend, how frequently they purchase, and what kind of products they prefer. You want to group similar customers so you can run targeted marketing campaigns.

KMeans clustering will help group customers into segments like:

Question:

Why don’t we label the customers manually and classify them?

Answer:

Because in real scenarios, we often don’t have labeled data. Clustering lets the algorithm find natural groupings based only on the data's structure.

KMeans in PySpark (Spark MLlib)

Let’s implement a simple KMeans example using PySpark. We’ll cluster synthetic data points in 2D space.

Step 1: Setup and Sample Data


from pyspark.sql import SparkSession
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler

# Create Spark session
spark = SparkSession.builder.appName("KMeansClustering").getOrCreate()

# Sample 2D data (x, y)
data = [
    (1.0, 1.0), (1.5, 1.8), (5.0, 8.0),
    (8.0, 8.0), (1.0, 0.6), (9.0, 11.0),
    (8.0, 2.0), (10.0, 2.0), (9.0, 3.0)
]

# Create DataFrame
df = spark.createDataFrame(data, ["x", "y"])
df.show()
    
+----+----+
|   x|   y|
+----+----+
| 1.0| 1.0|
| 1.5| 1.8|
| 5.0| 8.0|
| 8.0| 8.0|
| 1.0| 0.6|
| 9.0|11.0|
| 8.0| 2.0|
|10.0| 2.0|
| 9.0| 3.0|
+----+----+
    

Step 2: Assemble Features


# Convert x and y into a single features column
vec_assembler = VectorAssembler(inputCols=["x", "y"], outputCol="features")
final_data = vec_assembler.transform(df)
    

Step 3: Apply KMeans Algorithm


# Apply KMeans with 3 clusters
kmeans = KMeans(k=3, seed=1)
model = kmeans.fit(final_data)

# Make predictions
predictions = model.transform(final_data)
predictions.select("x", "y", "prediction").show()
    
+----+----+----------+
|   x|   y|prediction|
+----+----+----------+
| 1.0| 1.0|         1|
| 1.5| 1.8|         1|
| 5.0| 8.0|         0|
| 8.0| 8.0|         0|
| 1.0| 0.6|         1|
| 9.0|11.0|         0|
| 8.0| 2.0|         2|
|10.0| 2.0|         2|
| 9.0| 3.0|         2|
+----+----+----------+
    

Step 4: View Cluster Centers


# Print cluster centers
centers = model.clusterCenters()
print("Cluster Centers:")
for idx, center in enumerate(centers):
    print(f"Cluster {idx}: {center}")
    
Cluster Centers:
Cluster 0: [7.333333333333333, 9.0]
Cluster 1: [1.1666666666666667, 1.1333333333333333]
Cluster 2: [9.0, 2.3333333333333335]
    

Intuition Check

Question:

Why did we choose K=3 clusters?

Answer:

We looked at the data and assumed it may naturally fall into 3 groups. In real scenarios, you can use the "elbow method" to find the optimal value of K by plotting cluster inertia (within-cluster sum of squares) against K.

Summary

KMeans is a simple yet powerful clustering algorithm. In Spark, it's easy to apply even on large-scale data using MLlib. You now understand how to cluster similar data points and interpret cluster assignments and centroids using PySpark.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M