What is Linear Regression?
Linear Regression is a fundamental machine learning algorithm used to predict a continuous value based on the relationship between input variables (features) and an output variable (label).
It assumes that there is a linear relationship between the features and the label, which means the label can be calculated as a straight-line equation: y = mx + c
.
Real-World Example: Predicting House Prices
Suppose we want to predict the price of a house based on its size (in square feet). Using Linear Regression, we aim to find a line that best fits the data points of house size vs. price.
Question:
Why use Linear Regression and not something else?
Answer:
Linear Regression is simple, interpretable, and often a good starting point. If the data shows a linear trend, this model works very well.
Steps to Implement Linear Regression in Spark
- Prepare the dataset
- Load the data into a Spark DataFrame
- Transform features using
VectorAssembler
- Train the Linear Regression model
- Evaluate the model’s performance
Sample Dataset
We'll use a simple CSV with house size and price. Here's a look at the data:
size,price 1000,300000 1500,400000 2000,500000 2500,600000 3000,700000
Python Code: Linear Regression with PySpark
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
# Step 1: Start Spark session
spark = SparkSession.builder.appName("LinearRegressionExample").getOrCreate()
# Step 2: Load data
data = [
(1000, 300000),
(1500, 400000),
(2000, 500000),
(2500, 600000),
(3000, 700000)
]
columns = ["size", "price"]
df = spark.createDataFrame(data, columns)
# Step 3: Feature Transformation
assembler = VectorAssembler(inputCols=["size"], outputCol="features")
assembled_data = assembler.transform(df)
# Step 4: Train Linear Regression Model
lr = LinearRegression(featuresCol="features", labelCol="price")
model = lr.fit(assembled_data)
# Step 5: Model Summary
summary = model.summary
print("Coefficients:", model.coefficients)
print("Intercept:", model.intercept)
print("RMSE:", summary.rootMeanSquaredError)
print("R²:", summary.r2)
# Step 6: Make Predictions
predictions = model.transform(assembled_data)
predictions.select("features", "price", "prediction").show()
Coefficients: [200.0] Intercept: 100000.0 RMSE: 0.0 R²: 1.0 +--------+------+----------+ |features| price|prediction| +--------+------+----------+ | [1000.0]|300000| 300000.0| | [1500.0]|400000| 400000.0| | [2000.0]|500000| 500000.0| | [2500.0]|600000| 600000.0| | [3000.0]|700000| 700000.0| +--------+------+----------+
Understanding the Output
- Coefficients: The slope (m) in
y = mx + c
. It means for every additional square foot, price increases by 200. - Intercept: The base value (c) when size is zero. Here, it's 100000.
- RMSE: Root Mean Square Error – lower is better. In our small example, it's 0.
- R²: R-squared – a value between 0 and 1 showing how well the model fits. 1.0 means perfect fit.
Question:
What if the data has multiple features like size, location score, and age of house?
Answer:
We can include multiple columns in VectorAssembler
and Spark will still apply Linear Regression by adjusting weights (coefficients) for each input feature.
Key Takeaways
- Linear Regression is used to predict a number based on input variables.
- Spark MLlib makes it scalable to apply Linear Regression on large datasets.
- Understanding slope and intercept helps in interpreting the model output.
- You can apply this to real-world problems like predicting salaries, car prices, or even movie revenues.