Building Machine Learning Pipelines in Apache Spark

Machine learning pipelines in Spark allow you to streamline the process of building and deploying models. Pipelines simplify repetitive tasks like feature transformation, model training, and prediction. This is especially useful when you're working with large datasets.

What is a Machine Learning Pipeline?

A machine learning pipeline is a chain of data processing stages where each stage takes in data, performs a transformation, and passes it to the next stage. In Spark, this is implemented using the Pipeline class from the pyspark.ml module.

Typical pipeline stages include:

Data Cleaning – Handle nulls, filter out invalid rows
Feature Extraction – Convert raw data into numerical features
Feature Transformation – Scale or normalize features
Model Training – Fit an algorithm to the data
Prediction – Use the trained model to make predictions on new data

Real-Life Analogy

Imagine you're in a car factory. Each station (painting, assembling, testing) performs one step and passes the car to the next. Similarly, in ML pipelines, your data moves through different processing stations — until you get predictions at the end.

Why Use Pipelines?

You avoid writing repetitive transformation code
You can cross-validate and tune the entire flow
Your workflow becomes cleaner, modular, and production-ready

Question:

Can we manually apply transformations and training separately?

Answer:

Yes, but pipelines are designed to combine these steps into one consistent and reusable object, which is easier to maintain and deploy.

Example: Building a Classification Pipeline in PySpark

Let’s build a pipeline that classifies whether someone earns more than $50K per year based on the UCI Adult Census dataset.

Step-by-Step Pipeline

Read the data
Index categorical columns
Assemble all features
Train a Logistic Regression model
Make predictions

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Start Spark session
spark = SparkSession.builder.appName("MLPipelineExample").getOrCreate()

# Sample data
data = spark.read.csv("https://raw.githubusercontent.com/datablist/sample-csv-files/main/files/people/people-100.csv", header=True, inferSchema=True)

# Let's simulate a binary label for income (for demo purposes)
from pyspark.sql.functions import when
data = data.withColumn("income_label", when(data["salary"] > 70000, 1).otherwise(0))

# Drop nulls
data = data.dropna(subset=["salary", "gender", "age"])

# Encode 'gender'
gender_indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")

# Assemble features
assembler = VectorAssembler(
    inputCols=["genderIndex", "age", "salary"],
    outputCol="features"
)

# Logistic Regression
lr = LogisticRegression(featuresCol="features", labelCol="income_label")

# Create pipeline
pipeline = Pipeline(stages=[gender_indexer, assembler, lr])

# Fit pipeline
model = pipeline.fit(data)

# Predict
predictions = model.transform(data)

# Show results
predictions.select("gender", "age", "salary", "prediction", "income_label").show(5)

+------+---+------+----------+------------+
|gender|age|salary|prediction|income_label|
+------+---+------+----------+------------+
|  Male| 30| 72000|       1.0|           1|
|Female| 24| 65000|       0.0|           0|
|  Male| 45| 88000|       1.0|           1|
|Female| 34| 54000|       0.0|           0|
|  Male| 29| 76000|       1.0|           1|
+------+---+------+----------+------------+

Understanding the Code

StringIndexer converts categorical columns like "gender" into numerical form.
VectorAssembler merges all features into one column required for ML models.
Pipeline chains these steps together.
LogisticRegression is used as our classification algorithm.
model.transform() runs the full pipeline on input data to produce predictions.

Question:

Why can’t we pass categorical data like 'Male' or 'Female' directly to ML models?

Answer:

Machine learning models require numerical inputs. StringIndexer transforms categories into numbers (e.g., Male = 1, Female = 0), allowing the model to process them.

Summary

Machine learning pipelines in Spark are essential for building clean, maintainable, and scalable ML workflows. By understanding and using pipelines, beginners can build complete ML solutions without getting overwhelmed by manual data transformations.

⬅ Previous TopicFeature Engineering with VectorAssembler and StringIndexer in PySpark

Next Topic ⮕Classification Using Logistic Regression in Spark MLlib

Comments

Loading comments...

Building Machine Learning Pipelines in Apache Spark