⬅ Previous Topic
Linear Regression in Spark MLlibNext Topic ⮕
Movie Ratings Project with Apache Spark⬅ Previous Topic
Linear Regression in Spark MLlibNext Topic ⮕
Movie Ratings Project with Apache SparkClustering is an unsupervised machine learning technique used to group similar data points together based on their features. Unlike classification, clustering doesn’t need labeled data — it tries to discover the structure hidden within the data itself.
KMeans is the most popular clustering algorithm. It divides the data into K clusters, where each data point belongs to the cluster with the nearest mean (also called centroid).
K
.K
random points as centroids.Imagine you have data on customer behavior: how much they spend, how frequently they purchase, and what kind of products they prefer. You want to group similar customers so you can run targeted marketing campaigns.
KMeans clustering will help group customers into segments like:
Why don’t we label the customers manually and classify them?
Because in real scenarios, we often don’t have labeled data. Clustering lets the algorithm find natural groupings based only on the data's structure.
Let’s implement a simple KMeans example using PySpark. We’ll cluster synthetic data points in 2D space.
from pyspark.sql import SparkSession
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
# Create Spark session
spark = SparkSession.builder.appName("KMeansClustering").getOrCreate()
# Sample 2D data (x, y)
data = [
(1.0, 1.0), (1.5, 1.8), (5.0, 8.0),
(8.0, 8.0), (1.0, 0.6), (9.0, 11.0),
(8.0, 2.0), (10.0, 2.0), (9.0, 3.0)
]
# Create DataFrame
df = spark.createDataFrame(data, ["x", "y"])
df.show()
+----+----+ | x| y| +----+----+ | 1.0| 1.0| | 1.5| 1.8| | 5.0| 8.0| | 8.0| 8.0| | 1.0| 0.6| | 9.0|11.0| | 8.0| 2.0| |10.0| 2.0| | 9.0| 3.0| +----+----+
# Convert x and y into a single features column
vec_assembler = VectorAssembler(inputCols=["x", "y"], outputCol="features")
final_data = vec_assembler.transform(df)
# Apply KMeans with 3 clusters
kmeans = KMeans(k=3, seed=1)
model = kmeans.fit(final_data)
# Make predictions
predictions = model.transform(final_data)
predictions.select("x", "y", "prediction").show()
+----+----+----------+ | x| y|prediction| +----+----+----------+ | 1.0| 1.0| 1| | 1.5| 1.8| 1| | 5.0| 8.0| 0| | 8.0| 8.0| 0| | 1.0| 0.6| 1| | 9.0|11.0| 0| | 8.0| 2.0| 2| |10.0| 2.0| 2| | 9.0| 3.0| 2| +----+----+----------+
# Print cluster centers
centers = model.clusterCenters()
print("Cluster Centers:")
for idx, center in enumerate(centers):
print(f"Cluster {idx}: {center}")
Cluster Centers: Cluster 0: [7.333333333333333, 9.0] Cluster 1: [1.1666666666666667, 1.1333333333333333] Cluster 2: [9.0, 2.3333333333333335]
Why did we choose K=3
clusters?
We looked at the data and assumed it may naturally fall into 3 groups. In real scenarios, you can use the "elbow method" to find the optimal value of K by plotting cluster inertia (within-cluster sum of squares) against K.
KMeans is a simple yet powerful clustering algorithm. In Spark, it's easy to apply even on large-scale data using MLlib. You now understand how to cluster similar data points and interpret cluster assignments and centroids using PySpark.
⬅ Previous Topic
Linear Regression in Spark MLlibNext Topic ⮕
Movie Ratings Project with Apache SparkYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.