⬅ Previous Topic
Building Machine Learning Pipelines in Apache SparkNext Topic ⮕
Linear Regression in Spark MLlib⬅ Previous Topic
Building Machine Learning Pipelines in Apache SparkNext Topic ⮕
Linear Regression in Spark MLlibLogistic Regression is a statistical method used for binary classification problems. Despite the name "regression", it is actually used to predict categorical outcomes, such as yes/no, true/false, or spam/not spam.
It works by estimating probabilities using a logistic (sigmoid) function, which maps any real-valued number into a range between 0 and 1.
Logistic Regression is best used when your target variable is binary (i.e., it has only two possible outcomes).
The logistic regression model calculates a weighted sum of the input features and passes it through a sigmoid function:
sigmoid(z) = 1 / (1 + e^(-z))
This sigmoid output is interpreted as the probability that the input belongs to class 1. If the probability is greater than 0.5, we predict class 1; otherwise, class 0.
Suppose we have a telecom company trying to predict whether a customer will leave (churn) based on usage data. This is a binary classification problem.
Why not use Linear Regression for this task?
Because Linear Regression can predict values beyond 0 and 1, while we need a probability score between 0 and 1. Logistic Regression naturally restricts the output using a sigmoid function.
Apache Spark provides a high-level API for machine learning through MLlib. We’ll use LogisticRegression
from pyspark.ml.classification
.
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Create Spark session
spark = SparkSession.builder.appName("LogisticRegressionExample").getOrCreate()
# Sample training data
data = [
(0.0, 1.0, 3.0, 0.0),
(1.0, 2.0, 1.0, 1.0),
(0.0, 2.0, 2.0, 0.0),
(1.0, 3.0, 1.0, 1.0),
(0.0, 1.0, 4.0, 0.0)
]
columns = ["feature1", "feature2", "feature3", "label"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Combine features into a single vector column
vec = VectorAssembler(inputCols=["feature1", "feature2", "feature3"], outputCol="features")
df = vec.transform(df)
# Initialize and train logistic regression model
lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(df)
# Make predictions
predictions = model.transform(df)
predictions.select("features", "label", "prediction", "probability").show()
+-------------+-----+----------+--------------------+ | features|label|prediction| probability| +-------------+-----+----------+--------------------+ |[0.0,1.0,3.0]| 0.0| 0.0|[0.7296,0.2704] | |[1.0,2.0,1.0]| 1.0| 1.0|[0.4109,0.5891] | |[0.0,2.0,2.0]| 0.0| 0.0|[0.6423,0.3577] | |[1.0,3.0,1.0]| 1.0| 1.0|[0.3512,0.6488] | |[0.0,1.0,4.0]| 0.0| 0.0|[0.7623,0.2377] | +-------------+-----+----------+--------------------+
evaluator = BinaryClassificationEvaluator(labelCol="label")
accuracy = evaluator.evaluate(predictions)
print("Model accuracy:", accuracy)
Model accuracy: 0.71
Logistic Regression is a simple yet powerful algorithm for binary classification. In Spark, using MLlib with PySpark allows you to scale this to huge datasets across distributed systems.
As a beginner, it's a great first algorithm to understand the basics of machine learning workflows, feature preparation, training, prediction, and evaluation.
⬅ Previous Topic
Building Machine Learning Pipelines in Apache SparkNext Topic ⮕
Linear Regression in Spark MLlibYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.