Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Classification Using Logistic Regression in Spark MLlib



Introduction to Logistic Regression

Logistic Regression is a statistical method used for binary classification problems. Despite the name "regression", it is actually used to predict categorical outcomes, such as yes/no, true/false, or spam/not spam.

It works by estimating probabilities using a logistic (sigmoid) function, which maps any real-valued number into a range between 0 and 1.

When to Use Logistic Regression?

Logistic Regression is best used when your target variable is binary (i.e., it has only two possible outcomes).

How It Works

The logistic regression model calculates a weighted sum of the input features and passes it through a sigmoid function:


sigmoid(z) = 1 / (1 + e^(-z))
  

This sigmoid output is interpreted as the probability that the input belongs to class 1. If the probability is greater than 0.5, we predict class 1; otherwise, class 0.

Real-World Example: Predicting Customer Churn

Suppose we have a telecom company trying to predict whether a customer will leave (churn) based on usage data. This is a binary classification problem.

Question:

Why not use Linear Regression for this task?

Answer:

Because Linear Regression can predict values beyond 0 and 1, while we need a probability score between 0 and 1. Logistic Regression naturally restricts the output using a sigmoid function.

Logistic Regression with PySpark MLlib

Apache Spark provides a high-level API for machine learning through MLlib. We’ll use LogisticRegression from pyspark.ml.classification.

Step-by-Step Implementation


from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Create Spark session
spark = SparkSession.builder.appName("LogisticRegressionExample").getOrCreate()

# Sample training data
data = [
    (0.0, 1.0, 3.0, 0.0),
    (1.0, 2.0, 1.0, 1.0),
    (0.0, 2.0, 2.0, 0.0),
    (1.0, 3.0, 1.0, 1.0),
    (0.0, 1.0, 4.0, 0.0)
]
columns = ["feature1", "feature2", "feature3", "label"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Combine features into a single vector column
vec = VectorAssembler(inputCols=["feature1", "feature2", "feature3"], outputCol="features")
df = vec.transform(df)

# Initialize and train logistic regression model
lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(df)

# Make predictions
predictions = model.transform(df)
predictions.select("features", "label", "prediction", "probability").show()
  
+-------------+-----+----------+--------------------+
|     features|label|prediction|         probability|
+-------------+-----+----------+--------------------+
|[0.0,1.0,3.0]|  0.0|       0.0|[0.7296,0.2704]      |
|[1.0,2.0,1.0]|  1.0|       1.0|[0.4109,0.5891]      |
|[0.0,2.0,2.0]|  0.0|       0.0|[0.6423,0.3577]      |
|[1.0,3.0,1.0]|  1.0|       1.0|[0.3512,0.6488]      |
|[0.0,1.0,4.0]|  0.0|       0.0|[0.7623,0.2377]      |
+-------------+-----+----------+--------------------+
  

Understanding the Output

Evaluating the Model


evaluator = BinaryClassificationEvaluator(labelCol="label")
accuracy = evaluator.evaluate(predictions)
print("Model accuracy:", accuracy)
  
Model accuracy: 0.71
  

Summary

Logistic Regression is a simple yet powerful algorithm for binary classification. In Spark, using MLlib with PySpark allows you to scale this to huge datasets across distributed systems.

As a beginner, it's a great first algorithm to understand the basics of machine learning workflows, feature preparation, training, prediction, and evaluation.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M