⬅ Previous Topic
Classification Using Logistic Regression in Spark MLlibNext Topic ⮕
Clustering with KMeans in Spark MLlib⬅ Previous Topic
Classification Using Logistic Regression in Spark MLlibNext Topic ⮕
Clustering with KMeans in Spark MLlibLinear Regression is a fundamental machine learning algorithm used to predict a continuous value based on the relationship between input variables (features) and an output variable (label).
It assumes that there is a linear relationship between the features and the label, which means the label can be calculated as a straight-line equation: y = mx + c
.
Suppose we want to predict the price of a house based on its size (in square feet). Using Linear Regression, we aim to find a line that best fits the data points of house size vs. price.
Why use Linear Regression and not something else?
Linear Regression is simple, interpretable, and often a good starting point. If the data shows a linear trend, this model works very well.
VectorAssembler
We'll use a simple CSV with house size and price. Here's a look at the data:
size,price 1000,300000 1500,400000 2000,500000 2500,600000 3000,700000
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
# Step 1: Start Spark session
spark = SparkSession.builder.appName("LinearRegressionExample").getOrCreate()
# Step 2: Load data
data = [
(1000, 300000),
(1500, 400000),
(2000, 500000),
(2500, 600000),
(3000, 700000)
]
columns = ["size", "price"]
df = spark.createDataFrame(data, columns)
# Step 3: Feature Transformation
assembler = VectorAssembler(inputCols=["size"], outputCol="features")
assembled_data = assembler.transform(df)
# Step 4: Train Linear Regression Model
lr = LinearRegression(featuresCol="features", labelCol="price")
model = lr.fit(assembled_data)
# Step 5: Model Summary
summary = model.summary
print("Coefficients:", model.coefficients)
print("Intercept:", model.intercept)
print("RMSE:", summary.rootMeanSquaredError)
print("R²:", summary.r2)
# Step 6: Make Predictions
predictions = model.transform(assembled_data)
predictions.select("features", "price", "prediction").show()
Coefficients: [200.0] Intercept: 100000.0 RMSE: 0.0 R²: 1.0 +--------+------+----------+ |features| price|prediction| +--------+------+----------+ | [1000.0]|300000| 300000.0| | [1500.0]|400000| 400000.0| | [2000.0]|500000| 500000.0| | [2500.0]|600000| 600000.0| | [3000.0]|700000| 700000.0| +--------+------+----------+
y = mx + c
. It means for every additional square foot, price increases by 200.What if the data has multiple features like size, location score, and age of house?
We can include multiple columns in VectorAssembler
and Spark will still apply Linear Regression by adjusting weights (coefficients) for each input feature.
⬅ Previous Topic
Classification Using Logistic Regression in Spark MLlibNext Topic ⮕
Clustering with KMeans in Spark MLlibYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.