Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Building Machine Learning Pipelines in Apache Spark



Building Machine Learning Pipelines in Apache Spark

Machine learning pipelines in Spark allow you to streamline the process of building and deploying models. Pipelines simplify repetitive tasks like feature transformation, model training, and prediction. This is especially useful when you're working with large datasets.

What is a Machine Learning Pipeline?

A machine learning pipeline is a chain of data processing stages where each stage takes in data, performs a transformation, and passes it to the next stage. In Spark, this is implemented using the Pipeline class from the pyspark.ml module.

Typical pipeline stages include:

Real-Life Analogy

Imagine you're in a car factory. Each station (painting, assembling, testing) performs one step and passes the car to the next. Similarly, in ML pipelines, your data moves through different processing stations — until you get predictions at the end.

Why Use Pipelines?

Question:

Can we manually apply transformations and training separately?

Answer:

Yes, but pipelines are designed to combine these steps into one consistent and reusable object, which is easier to maintain and deploy.

Example: Building a Classification Pipeline in PySpark

Let’s build a pipeline that classifies whether someone earns more than $50K per year based on the UCI Adult Census dataset.

Step-by-Step Pipeline

  1. Read the data
  2. Index categorical columns
  3. Assemble all features
  4. Train a Logistic Regression model
  5. Make predictions

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Start Spark session
spark = SparkSession.builder.appName("MLPipelineExample").getOrCreate()

# Sample data
data = spark.read.csv("https://raw.githubusercontent.com/datablist/sample-csv-files/main/files/people/people-100.csv", header=True, inferSchema=True)

# Let's simulate a binary label for income (for demo purposes)
from pyspark.sql.functions import when
data = data.withColumn("income_label", when(data["salary"] > 70000, 1).otherwise(0))

# Drop nulls
data = data.dropna(subset=["salary", "gender", "age"])

# Encode 'gender'
gender_indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")

# Assemble features
assembler = VectorAssembler(
    inputCols=["genderIndex", "age", "salary"],
    outputCol="features"
)

# Logistic Regression
lr = LogisticRegression(featuresCol="features", labelCol="income_label")

# Create pipeline
pipeline = Pipeline(stages=[gender_indexer, assembler, lr])

# Fit pipeline
model = pipeline.fit(data)

# Predict
predictions = model.transform(data)

# Show results
predictions.select("gender", "age", "salary", "prediction", "income_label").show(5)
    
+------+---+------+----------+------------+
|gender|age|salary|prediction|income_label|
+------+---+------+----------+------------+
|  Male| 30| 72000|       1.0|           1|
|Female| 24| 65000|       0.0|           0|
|  Male| 45| 88000|       1.0|           1|
|Female| 34| 54000|       0.0|           0|
|  Male| 29| 76000|       1.0|           1|
+------+---+------+----------+------------+
    

Understanding the Code

Question:

Why can’t we pass categorical data like 'Male' or 'Female' directly to ML models?

Answer:

Machine learning models require numerical inputs. StringIndexer transforms categories into numbers (e.g., Male = 1, Female = 0), allowing the model to process them.

Summary

Machine learning pipelines in Spark are essential for building clean, maintainable, and scalable ML workflows. By understanding and using pipelines, beginners can build complete ML solutions without getting overwhelmed by manual data transformations.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M