Apache Spark CourseApache Spark Course1
Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Linear Regression in Spark MLlib

What is Linear Regression?

Linear Regression is a fundamental machine learning algorithm used to predict a continuous value based on the relationship between input variables (features) and an output variable (label).

It assumes that there is a linear relationship between the features and the label, which means the label can be calculated as a straight-line equation: y = mx + c.

Real-World Example: Predicting House Prices

Suppose we want to predict the price of a house based on its size (in square feet). Using Linear Regression, we aim to find a line that best fits the data points of house size vs. price.

Question:

Why use Linear Regression and not something else?

Answer:

Linear Regression is simple, interpretable, and often a good starting point. If the data shows a linear trend, this model works very well.

Steps to Implement Linear Regression in Spark

  1. Prepare the dataset
  2. Load the data into a Spark DataFrame
  3. Transform features using VectorAssembler
  4. Train the Linear Regression model
  5. Evaluate the model’s performance

Sample Dataset

We'll use a simple CSV with house size and price. Here's a look at the data:

size,price
1000,300000
1500,400000
2000,500000
2500,600000
3000,700000

Python Code: Linear Regression with PySpark

from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# Step 1: Start Spark session
spark = SparkSession.builder.appName("LinearRegressionExample").getOrCreate()

# Step 2: Load data
data = [
    (1000, 300000),
    (1500, 400000),
    (2000, 500000),
    (2500, 600000),
    (3000, 700000)
]
columns = ["size", "price"]
df = spark.createDataFrame(data, columns)

# Step 3: Feature Transformation
assembler = VectorAssembler(inputCols=["size"], outputCol="features")
assembled_data = assembler.transform(df)

# Step 4: Train Linear Regression Model
lr = LinearRegression(featuresCol="features", labelCol="price")
model = lr.fit(assembled_data)

# Step 5: Model Summary
summary = model.summary
print("Coefficients:", model.coefficients)
print("Intercept:", model.intercept)
print("RMSE:", summary.rootMeanSquaredError)
print("R²:", summary.r2)

# Step 6: Make Predictions
predictions = model.transform(assembled_data)
predictions.select("features", "price", "prediction").show()
Coefficients: [200.0]
Intercept: 100000.0
RMSE: 0.0
R²: 1.0

+--------+------+----------+
|features| price|prediction|
+--------+------+----------+
| [1000.0]|300000|  300000.0|
| [1500.0]|400000|  400000.0|
| [2000.0]|500000|  500000.0|
| [2500.0]|600000|  600000.0|
| [3000.0]|700000|  700000.0|
+--------+------+----------+

Understanding the Output

  • Coefficients: The slope (m) in y = mx + c. It means for every additional square foot, price increases by 200.
  • Intercept: The base value (c) when size is zero. Here, it's 100000.
  • RMSE: Root Mean Square Error – lower is better. In our small example, it's 0.
  • R²: R-squared – a value between 0 and 1 showing how well the model fits. 1.0 means perfect fit.

Question:

What if the data has multiple features like size, location score, and age of house?

Answer:

We can include multiple columns in VectorAssembler and Spark will still apply Linear Regression by adjusting weights (coefficients) for each input feature.

Key Takeaways

  • Linear Regression is used to predict a number based on input variables.
  • Spark MLlib makes it scalable to apply Linear Regression on large datasets.
  • Understanding slope and intercept helps in interpreting the model output.
  • You can apply this to real-world problems like predicting salaries, car prices, or even movie revenues.