Classification Using Logistic Regression in Spark MLlib

Introduction to Logistic Regression

Logistic Regression is a statistical method used for binary classification problems. Despite the name "regression", it is actually used to predict categorical outcomes, such as yes/no, true/false, or spam/not spam.

It works by estimating probabilities using a logistic (sigmoid) function, which maps any real-valued number into a range between 0 and 1.

When to Use Logistic Regression?

Logistic Regression is best used when your target variable is binary (i.e., it has only two possible outcomes).

Predicting whether a customer will buy a product (Yes or No)
Classifying an email as spam or not spam
Determining if a patient has a disease (Positive or Negative)

How It Works

The logistic regression model calculates a weighted sum of the input features and passes it through a sigmoid function:

sigmoid(z) = 1 / (1 + e^(-z))

This sigmoid output is interpreted as the probability that the input belongs to class 1. If the probability is greater than 0.5, we predict class 1; otherwise, class 0.

Real-World Example: Predicting Customer Churn

Suppose we have a telecom company trying to predict whether a customer will leave (churn) based on usage data. This is a binary classification problem.

Question:

Why not use Linear Regression for this task?

Answer:

Because Linear Regression can predict values beyond 0 and 1, while we need a probability score between 0 and 1. Logistic Regression naturally restricts the output using a sigmoid function.

Logistic Regression with PySpark MLlib

Apache Spark provides a high-level API for machine learning through MLlib. We’ll use LogisticRegression from pyspark.ml.classification.

Step-by-Step Implementation

from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Create Spark session
spark = SparkSession.builder.appName("LogisticRegressionExample").getOrCreate()

# Sample training data
data = [
    (0.0, 1.0, 3.0, 0.0),
    (1.0, 2.0, 1.0, 1.0),
    (0.0, 2.0, 2.0, 0.0),
    (1.0, 3.0, 1.0, 1.0),
    (0.0, 1.0, 4.0, 0.0)
]
columns = ["feature1", "feature2", "feature3", "label"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Combine features into a single vector column
vec = VectorAssembler(inputCols=["feature1", "feature2", "feature3"], outputCol="features")
df = vec.transform(df)

# Initialize and train logistic regression model
lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(df)

# Make predictions
predictions = model.transform(df)
predictions.select("features", "label", "prediction", "probability").show()

+-------------+-----+----------+--------------------+
|     features|label|prediction|         probability|
+-------------+-----+----------+--------------------+
|[0.0,1.0,3.0]|  0.0|       0.0|[0.7296,0.2704]      |
|[1.0,2.0,1.0]|  1.0|       1.0|[0.4109,0.5891]      |
|[0.0,2.0,2.0]|  0.0|       0.0|[0.6423,0.3577]      |
|[1.0,3.0,1.0]|  1.0|       1.0|[0.3512,0.6488]      |
|[0.0,1.0,4.0]|  0.0|       0.0|[0.7623,0.2377]      |
+-------------+-----+----------+--------------------+

Understanding the Output

features: The input data after being combined into a single vector
label: The actual class (0 or 1)
prediction: The predicted class by the model
probability: The model's confidence for each class

Evaluating the Model

evaluator = BinaryClassificationEvaluator(labelCol="label")
accuracy = evaluator.evaluate(predictions)
print("Model accuracy:", accuracy)

Model accuracy: 0.71

Summary

Logistic Regression is a simple yet powerful algorithm for binary classification. In Spark, using MLlib with PySpark allows you to scale this to huge datasets across distributed systems.

As a beginner, it's a great first algorithm to understand the basics of machine learning workflows, feature preparation, training, prediction, and evaluation.

Classification Using Logistic Regression in Spark MLlib

Introduction to Logistic Regression

When to Use Logistic Regression?

How It Works

Real-World Example: Predicting Customer Churn

Question:

Answer:

Logistic Regression with PySpark MLlib

Step-by-Step Implementation

Understanding the Output

Evaluating the Model

Summary

Comments

Module 11: Introduction to Machine Learning with Spark MLlib❯

Classification Using Logistic Regression in Spark MLlib

Introduction to Logistic Regression

When to Use Logistic Regression?

How It Works

Real-World Example: Predicting Customer Churn

Question:

Answer:

Logistic Regression with PySpark MLlib

Step-by-Step Implementation

Understanding the Output

Evaluating the Model

Summary

Comments

Module 11: Introduction to Machine Learning with Spark MLlib❯

Player Settings