Machine Learning for BeginnersMachine Learning for Beginners1

Machine Learning - Linear Regression for Beginners

What is Linear Regression?

Linear Regression is one of the simplest and most widely used algorithms in machine learning. It is a supervised learning technique used for predicting a continuous value (like price, salary, temperature).

In simple terms, Linear Regression tries to draw a straight line through the data points that best represents the relationship between the input features (X) and the target variable (y).

Real-life Example:

Suppose you're a data scientist at a real estate company. You want to predict the price of a house based on its size (in square feet). By using previous house sale data (size and price), you can train a linear regression model to predict future prices.


Understanding the Formula

The equation for simple linear regression is:

y = mx + b
  • y: Target variable (e.g., house price)
  • x: Input feature (e.g., house size)
  • m: Slope of the line (how much y changes with x)
  • b: Intercept (value of y when x = 0)

Question:

What does the slope tell us in real life?

Answer: It tells us how much the house price increases (or decreases) for each additional square foot in size.


Step-by-step Example

Let’s take a small dataset of house sizes and their prices:


Size (sqft): [1000, 1500, 2000, 2500, 3000]
Price ($):   [200000, 250000, 300000, 350000, 400000]

We want to build a model that can predict the price of a new house, say 2200 sqft.

Python Code

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Step 1: Prepare the data
X = np.array([1000, 1500, 2000, 2500, 3000]).reshape(-1, 1)  # Feature (2D)
y = np.array([200000, 250000, 300000, 350000, 400000])      # Target (1D)

# Step 2: Create and train the model
model = LinearRegression()
model.fit(X, y)

# Step 3: Make a prediction
predicted_price = model.predict([[2200]])
print(f"Predicted price for 2200 sqft: ${predicted_price[0]:.2f}")

# Step 4: Plot the data and prediction line
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, model.predict(X), color='red', label='Prediction Line')
plt.xlabel("Size (sqft)")
plt.ylabel("Price ($)")
plt.title("Linear Regression: House Price Prediction")
plt.legend()
plt.grid(True)
plt.show()

Predicted price for 2200 sqft: $320000.00

Code Explanation:

  • Step 1: We create our input (X) and target (y) arrays. The reshape(-1, 1) converts a 1D array into a 2D column vector, as required by sklearn.
  • Step 2: We create an instance of LinearRegression() and fit it to our data.
  • Step 3: We use predict() to forecast the price for 2200 sqft.
  • Step 4: We plot the original data and the best-fit line learned by the model.

Another Example: Predicting Student Scores

Suppose you want to predict the final exam score based on the number of study hours:


Hours Studied: [1, 2, 3, 4, 5]
Scores:        [20, 40, 60, 80, 100]

Python Code:

# New example: Study hours vs Scores
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([20, 40, 60, 80, 100])

model = LinearRegression()
model.fit(X, y)

predicted_score = model.predict([[3.5]])
print(f"Predicted score for 3.5 hours study: {predicted_score[0]:.2f}")

Predicted score for 3.5 hours study: 70.00

Question:

What would happen if a student studies for 0 hours?

Answer: The model will still predict a score based on the intercept. In this case, it might be close to 0.


When to Use Linear Regression?

  • When the target is a continuous number
  • When there’s a roughly linear relationship between input and output
  • When you want interpretability (you can explain slope and intercept easily)

Limitations

  • Doesn’t work well with non-linear data
  • Very sensitive to outliers
  • Can underperform if important features are missing

Summary

  • Linear Regression fits a line between input and output
  • Scikit-learn makes it easy to implement
  • Great for getting started with predictive models

In the next module, we’ll explore Logistic Regression, which is used for classification problems.