What is Linear Regression?
Linear Regression is one of the simplest and most widely used algorithms in machine learning. It is a supervised learning technique used for predicting a continuous value (like price, salary, temperature).
In simple terms, Linear Regression tries to draw a straight line through the data points that best represents the relationship between the input features (X) and the target variable (y).
Real-life Example:
Suppose you're a data scientist at a real estate company. You want to predict the price of a house based on its size (in square feet). By using previous house sale data (size and price), you can train a linear regression model to predict future prices.
Understanding the Formula
The equation for simple linear regression is:
y = mx + b
y
: Target variable (e.g., house price)x
: Input feature (e.g., house size)m
: Slope of the line (how much y changes with x)b
: Intercept (value of y when x = 0)
🧠 Question:
What does the slope tell us in real life?
Answer: It tells us how much the house price increases (or decreases) for each additional square foot in size.
Step-by-step Example
Let’s take a small dataset of house sizes and their prices:
Size (sqft): [1000, 1500, 2000, 2500, 3000]
Price ($): [200000, 250000, 300000, 350000, 400000]
We want to build a model that can predict the price of a new house, say 2200 sqft.
Python Code
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Step 1: Prepare the data
X = np.array([1000, 1500, 2000, 2500, 3000]).reshape(-1, 1) # Feature (2D)
y = np.array([200000, 250000, 300000, 350000, 400000]) # Target (1D)
# Step 2: Create and train the model
model = LinearRegression()
model.fit(X, y)
# Step 3: Make a prediction
predicted_price = model.predict([[2200]])
print(f"Predicted price for 2200 sqft: ${predicted_price[0]:.2f}")
# Step 4: Plot the data and prediction line
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, model.predict(X), color='red', label='Prediction Line')
plt.xlabel("Size (sqft)")
plt.ylabel("Price ($)")
plt.title("Linear Regression: House Price Prediction")
plt.legend()
plt.grid(True)
plt.show()
Output:
Predicted price for 2200 sqft: $320000.00
Code Explanation:
- Step 1: We create our input (X) and target (y) arrays. The
reshape(-1, 1)
converts a 1D array into a 2D column vector, as required by sklearn. - Step 2: We create an instance of
LinearRegression()
and fit it to our data. - Step 3: We use
predict()
to forecast the price for 2200 sqft. - Step 4: We plot the original data and the best-fit line learned by the model.
Another Example: Predicting Student Scores
Suppose you want to predict the final exam score based on the number of study hours:
Hours Studied: [1, 2, 3, 4, 5]
Scores: [20, 40, 60, 80, 100]
Python Code:
# New example: Study hours vs Scores
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([20, 40, 60, 80, 100])
model = LinearRegression()
model.fit(X, y)
predicted_score = model.predict([[3.5]])
print(f"Predicted score for 3.5 hours study: {predicted_score[0]:.2f}")
Output:
Predicted score for 3.5 hours study: 70.00
🧠 Question:
What would happen if a student studies for 0 hours?
Answer: The model will still predict a score based on the intercept. In this case, it might be close to 0.
When to Use Linear Regression?
- When the target is a continuous number
- When there’s a roughly linear relationship between input and output
- When you want interpretability (you can explain slope and intercept easily)
⚠️ Limitations
- Doesn’t work well with non-linear data
- Very sensitive to outliers
- Can underperform if important features are missing
Summary
- Linear Regression fits a line between input and output
- Scikit-learn makes it easy to implement
- Great for getting started with predictive models
In the next module, we’ll explore Logistic Regression, which is used for classification problems.