Introduction to Logistic Regression
Logistic Regression is a statistical method used for binary classification problems. Despite the name "regression", it is actually used to predict categorical outcomes, such as yes/no, true/false, or spam/not spam.
It works by estimating probabilities using a logistic (sigmoid) function, which maps any real-valued number into a range between 0 and 1.
When to Use Logistic Regression?
Logistic Regression is best used when your target variable is binary (i.e., it has only two possible outcomes).
- Predicting whether a customer will buy a product (Yes or No)
- Classifying an email as spam or not spam
- Determining if a patient has a disease (Positive or Negative)
How It Works
The logistic regression model calculates a weighted sum of the input features and passes it through a sigmoid function:
sigmoid(z) = 1 / (1 + e^(-z))
This sigmoid output is interpreted as the probability that the input belongs to class 1. If the probability is greater than 0.5, we predict class 1; otherwise, class 0.
Real-World Example: Predicting Customer Churn
Suppose we have a telecom company trying to predict whether a customer will leave (churn) based on usage data. This is a binary classification problem.
Question:
Why not use Linear Regression for this task?
Answer:
Because Linear Regression can predict values beyond 0 and 1, while we need a probability score between 0 and 1. Logistic Regression naturally restricts the output using a sigmoid function.
Logistic Regression with PySpark MLlib
Apache Spark provides a high-level API for machine learning through MLlib. We’ll use LogisticRegression
from pyspark.ml.classification
.
Step-by-Step Implementation
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Create Spark session
spark = SparkSession.builder.appName("LogisticRegressionExample").getOrCreate()
# Sample training data
data = [
(0.0, 1.0, 3.0, 0.0),
(1.0, 2.0, 1.0, 1.0),
(0.0, 2.0, 2.0, 0.0),
(1.0, 3.0, 1.0, 1.0),
(0.0, 1.0, 4.0, 0.0)
]
columns = ["feature1", "feature2", "feature3", "label"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Combine features into a single vector column
vec = VectorAssembler(inputCols=["feature1", "feature2", "feature3"], outputCol="features")
df = vec.transform(df)
# Initialize and train logistic regression model
lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(df)
# Make predictions
predictions = model.transform(df)
predictions.select("features", "label", "prediction", "probability").show()
+-------------+-----+----------+--------------------+ | features|label|prediction| probability| +-------------+-----+----------+--------------------+ |[0.0,1.0,3.0]| 0.0| 0.0|[0.7296,0.2704] | |[1.0,2.0,1.0]| 1.0| 1.0|[0.4109,0.5891] | |[0.0,2.0,2.0]| 0.0| 0.0|[0.6423,0.3577] | |[1.0,3.0,1.0]| 1.0| 1.0|[0.3512,0.6488] | |[0.0,1.0,4.0]| 0.0| 0.0|[0.7623,0.2377] | +-------------+-----+----------+--------------------+
Understanding the Output
- features: The input data after being combined into a single vector
- label: The actual class (0 or 1)
- prediction: The predicted class by the model
- probability: The model's confidence for each class
Evaluating the Model
evaluator = BinaryClassificationEvaluator(labelCol="label")
accuracy = evaluator.evaluate(predictions)
print("Model accuracy:", accuracy)
Model accuracy: 0.71
Summary
Logistic Regression is a simple yet powerful algorithm for binary classification. In Spark, using MLlib with PySpark allows you to scale this to huge datasets across distributed systems.
As a beginner, it's a great first algorithm to understand the basics of machine learning workflows, feature preparation, training, prediction, and evaluation.