Apache Spark CourseApache Spark Course1

Module 12: Project – Real-World Data PipelineModule 12: Project – Real-World Data Pipeline1

Linear Regression in Spark MLlib



What is Linear Regression?

Linear Regression is a fundamental machine learning algorithm used to predict a continuous value based on the relationship between input variables (features) and an output variable (label).

It assumes that there is a linear relationship between the features and the label, which means the label can be calculated as a straight-line equation: y = mx + c.

Real-World Example: Predicting House Prices

Suppose we want to predict the price of a house based on its size (in square feet). Using Linear Regression, we aim to find a line that best fits the data points of house size vs. price.

Question:

Why use Linear Regression and not something else?

Answer:

Linear Regression is simple, interpretable, and often a good starting point. If the data shows a linear trend, this model works very well.

Steps to Implement Linear Regression in Spark

  1. Prepare the dataset
  2. Load the data into a Spark DataFrame
  3. Transform features using VectorAssembler
  4. Train the Linear Regression model
  5. Evaluate the model’s performance

Sample Dataset

We'll use a simple CSV with house size and price. Here's a look at the data:

size,price
1000,300000
1500,400000
2000,500000
2500,600000
3000,700000
    

Python Code: Linear Regression with PySpark


from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# Step 1: Start Spark session
spark = SparkSession.builder.appName("LinearRegressionExample").getOrCreate()

# Step 2: Load data
data = [
    (1000, 300000),
    (1500, 400000),
    (2000, 500000),
    (2500, 600000),
    (3000, 700000)
]
columns = ["size", "price"]
df = spark.createDataFrame(data, columns)

# Step 3: Feature Transformation
assembler = VectorAssembler(inputCols=["size"], outputCol="features")
assembled_data = assembler.transform(df)

# Step 4: Train Linear Regression Model
lr = LinearRegression(featuresCol="features", labelCol="price")
model = lr.fit(assembled_data)

# Step 5: Model Summary
summary = model.summary
print("Coefficients:", model.coefficients)
print("Intercept:", model.intercept)
print("RMSE:", summary.rootMeanSquaredError)
print("R²:", summary.r2)

# Step 6: Make Predictions
predictions = model.transform(assembled_data)
predictions.select("features", "price", "prediction").show()
    
Coefficients: [200.0]
Intercept: 100000.0
RMSE: 0.0
R²: 1.0

+--------+------+----------+
|features| price|prediction|
+--------+------+----------+
| [1000.0]|300000|  300000.0|
| [1500.0]|400000|  400000.0|
| [2000.0]|500000|  500000.0|
| [2500.0]|600000|  600000.0|
| [3000.0]|700000|  700000.0|
+--------+------+----------+
    

Understanding the Output

Question:

What if the data has multiple features like size, location score, and age of house?

Answer:

We can include multiple columns in VectorAssembler and Spark will still apply Linear Regression by adjusting weights (coefficients) for each input feature.

Key Takeaways



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M