Linear Regression in Spark MLlib

What is Linear Regression?

Linear Regression is a fundamental machine learning algorithm used to predict a continuous value based on the relationship between input variables (features) and an output variable (label).

It assumes that there is a linear relationship between the features and the label, which means the label can be calculated as a straight-line equation: y = mx + c.

Real-World Example: Predicting House Prices

Suppose we want to predict the price of a house based on its size (in square feet). Using Linear Regression, we aim to find a line that best fits the data points of house size vs. price.

Question:

Why use Linear Regression and not something else?

Answer:

Linear Regression is simple, interpretable, and often a good starting point. If the data shows a linear trend, this model works very well.

Steps to Implement Linear Regression in Spark

Prepare the dataset
Load the data into a Spark DataFrame
Transform features using VectorAssembler
Train the Linear Regression model
Evaluate the model’s performance

Sample Dataset

We'll use a simple CSV with house size and price. Here's a look at the data:

size,price
1000,300000
1500,400000
2000,500000
2500,600000
3000,700000

Python Code: Linear Regression with PySpark

from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# Step 1: Start Spark session
spark = SparkSession.builder.appName("LinearRegressionExample").getOrCreate()

# Step 2: Load data
data = [
    (1000, 300000),
    (1500, 400000),
    (2000, 500000),
    (2500, 600000),
    (3000, 700000)
]
columns = ["size", "price"]
df = spark.createDataFrame(data, columns)

# Step 3: Feature Transformation
assembler = VectorAssembler(inputCols=["size"], outputCol="features")
assembled_data = assembler.transform(df)

# Step 4: Train Linear Regression Model
lr = LinearRegression(featuresCol="features", labelCol="price")
model = lr.fit(assembled_data)

# Step 5: Model Summary
summary = model.summary
print("Coefficients:", model.coefficients)
print("Intercept:", model.intercept)
print("RMSE:", summary.rootMeanSquaredError)
print("R²:", summary.r2)

# Step 6: Make Predictions
predictions = model.transform(assembled_data)
predictions.select("features", "price", "prediction").show()

Coefficients: [200.0]
Intercept: 100000.0
RMSE: 0.0
R²: 1.0

+--------+------+----------+
|features| price|prediction|
+--------+------+----------+
| [1000.0]|300000|  300000.0|
| [1500.0]|400000|  400000.0|
| [2000.0]|500000|  500000.0|
| [2500.0]|600000|  600000.0|
| [3000.0]|700000|  700000.0|
+--------+------+----------+

Understanding the Output

Coefficients: The slope (m) in y = mx + c. It means for every additional square foot, price increases by 200.
Intercept: The base value (c) when size is zero. Here, it's 100000.
RMSE: Root Mean Square Error – lower is better. In our small example, it's 0.
R²: R-squared – a value between 0 and 1 showing how well the model fits. 1.0 means perfect fit.

Question:

What if the data has multiple features like size, location score, and age of house?

Answer:

We can include multiple columns in VectorAssembler and Spark will still apply Linear Regression by adjusting weights (coefficients) for each input feature.

Key Takeaways

Linear Regression is used to predict a number based on input variables.
Spark MLlib makes it scalable to apply Linear Regression on large datasets.
Understanding slope and intercept helps in interpreting the model output.
You can apply this to real-world problems like predicting salaries, car prices, or even movie revenues.

Linear Regression in Spark MLlib

What is Linear Regression?

Real-World Example: Predicting House Prices

Question:

Answer:

Steps to Implement Linear Regression in Spark

Sample Dataset

Python Code: Linear Regression with PySpark

Understanding the Output

Question:

Answer:

Key Takeaways

Comments

Module 11: Introduction to Machine Learning with Spark MLlib❯

Linear Regression in Spark MLlib

What is Linear Regression?

Real-World Example: Predicting House Prices

Question:

Answer:

Steps to Implement Linear Regression in Spark

Sample Dataset

Python Code: Linear Regression with PySpark

Understanding the Output

Question:

Answer:

Key Takeaways

Comments

Module 11: Introduction to Machine Learning with Spark MLlib❯

Player Settings