You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.
Machine learning pipelines in Spark allow you to streamline the process of building and deploying models. Pipelines simplify repetitive tasks like feature transformation, model training, and prediction. This is especially useful when you're working with large datasets.
A machine learning pipeline is a chain of data processing stages where each stage takes in data, performs a transformation, and passes it to the next stage. In Spark, this is implemented using the Pipeline
class from the pyspark.ml
module.
Typical pipeline stages include:
Imagine you're in a car factory. Each station (painting, assembling, testing) performs one step and passes the car to the next. Similarly, in ML pipelines, your data moves through different processing stations — until you get predictions at the end.
Can we manually apply transformations and training separately?
Yes, but pipelines are designed to combine these steps into one consistent and reusable object, which is easier to maintain and deploy.
Let’s build a pipeline that classifies whether someone earns more than $50K per year based on the UCI Adult Census dataset.
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Start Spark session
spark = SparkSession.builder.appName("MLPipelineExample").getOrCreate()
# Sample data
data = spark.read.csv("https://raw.githubusercontent.com/datablist/sample-csv-files/main/files/people/people-100.csv", header=True, inferSchema=True)
# Let's simulate a binary label for income (for demo purposes)
from pyspark.sql.functions import when
data = data.withColumn("income_label", when(data["salary"] > 70000, 1).otherwise(0))
# Drop nulls
data = data.dropna(subset=["salary", "gender", "age"])
# Encode 'gender'
gender_indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")
# Assemble features
assembler = VectorAssembler(
inputCols=["genderIndex", "age", "salary"],
outputCol="features"
)
# Logistic Regression
lr = LogisticRegression(featuresCol="features", labelCol="income_label")
# Create pipeline
pipeline = Pipeline(stages=[gender_indexer, assembler, lr])
# Fit pipeline
model = pipeline.fit(data)
# Predict
predictions = model.transform(data)
# Show results
predictions.select("gender", "age", "salary", "prediction", "income_label").show(5)
+------+---+------+----------+------------+ |gender|age|salary|prediction|income_label| +------+---+------+----------+------------+ | Male| 30| 72000| 1.0| 1| |Female| 24| 65000| 0.0| 0| | Male| 45| 88000| 1.0| 1| |Female| 34| 54000| 0.0| 0| | Male| 29| 76000| 1.0| 1| +------+---+------+----------+------------+
StringIndexer
converts categorical columns like "gender" into numerical form.VectorAssembler
merges all features into one column required for ML models.Pipeline
chains these steps together.LogisticRegression
is used as our classification algorithm.model.transform()
runs the full pipeline on input data to produce predictions.Why can’t we pass categorical data like 'Male' or 'Female' directly to ML models?
Machine learning models require numerical inputs. StringIndexer transforms categories into numbers (e.g., Male = 1, Female = 0), allowing the model to process them.
Machine learning pipelines in Spark are essential for building clean, maintainable, and scalable ML workflows. By understanding and using pipelines, beginners can build complete ML solutions without getting overwhelmed by manual data transformations.
You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.