Apache Spark CourseApache Spark Course1
Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Classification Using Logistic Regression in Spark MLlib

Introduction to Logistic Regression

Logistic Regression is a statistical method used for binary classification problems. Despite the name "regression", it is actually used to predict categorical outcomes, such as yes/no, true/false, or spam/not spam.

It works by estimating probabilities using a logistic (sigmoid) function, which maps any real-valued number into a range between 0 and 1.

When to Use Logistic Regression?

Logistic Regression is best used when your target variable is binary (i.e., it has only two possible outcomes).

  • Predicting whether a customer will buy a product (Yes or No)
  • Classifying an email as spam or not spam
  • Determining if a patient has a disease (Positive or Negative)

How It Works

The logistic regression model calculates a weighted sum of the input features and passes it through a sigmoid function:

sigmoid(z) = 1 / (1 + e^(-z))
  

This sigmoid output is interpreted as the probability that the input belongs to class 1. If the probability is greater than 0.5, we predict class 1; otherwise, class 0.

Real-World Example: Predicting Customer Churn

Suppose we have a telecom company trying to predict whether a customer will leave (churn) based on usage data. This is a binary classification problem.

Question:

Why not use Linear Regression for this task?

Answer:

Because Linear Regression can predict values beyond 0 and 1, while we need a probability score between 0 and 1. Logistic Regression naturally restricts the output using a sigmoid function.

Logistic Regression with PySpark MLlib

Apache Spark provides a high-level API for machine learning through MLlib. We’ll use LogisticRegression from pyspark.ml.classification.

Step-by-Step Implementation

from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Create Spark session
spark = SparkSession.builder.appName("LogisticRegressionExample").getOrCreate()

# Sample training data
data = [
    (0.0, 1.0, 3.0, 0.0),
    (1.0, 2.0, 1.0, 1.0),
    (0.0, 2.0, 2.0, 0.0),
    (1.0, 3.0, 1.0, 1.0),
    (0.0, 1.0, 4.0, 0.0)
]
columns = ["feature1", "feature2", "feature3", "label"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Combine features into a single vector column
vec = VectorAssembler(inputCols=["feature1", "feature2", "feature3"], outputCol="features")
df = vec.transform(df)

# Initialize and train logistic regression model
lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(df)

# Make predictions
predictions = model.transform(df)
predictions.select("features", "label", "prediction", "probability").show()
  
+-------------+-----+----------+--------------------+
|     features|label|prediction|         probability|
+-------------+-----+----------+--------------------+
|[0.0,1.0,3.0]|  0.0|       0.0|[0.7296,0.2704]      |
|[1.0,2.0,1.0]|  1.0|       1.0|[0.4109,0.5891]      |
|[0.0,2.0,2.0]|  0.0|       0.0|[0.6423,0.3577]      |
|[1.0,3.0,1.0]|  1.0|       1.0|[0.3512,0.6488]      |
|[0.0,1.0,4.0]|  0.0|       0.0|[0.7623,0.2377]      |
+-------------+-----+----------+--------------------+
  

Understanding the Output

  • features: The input data after being combined into a single vector
  • label: The actual class (0 or 1)
  • prediction: The predicted class by the model
  • probability: The model's confidence for each class

Evaluating the Model

evaluator = BinaryClassificationEvaluator(labelCol="label")
accuracy = evaluator.evaluate(predictions)
print("Model accuracy:", accuracy)
  
Model accuracy: 0.71
  

Summary

Logistic Regression is a simple yet powerful algorithm for binary classification. In Spark, using MLlib with PySpark allows you to scale this to huge datasets across distributed systems.

As a beginner, it's a great first algorithm to understand the basics of machine learning workflows, feature preparation, training, prediction, and evaluation.