House Price Prediction – Real-World ML Project
In this beginner-friendly project, we'll use machine learning to predict house prices based on various features like size, location, and number of rooms. This is one of the most popular real-world use cases of ML in action!
Problem Statement
Given a dataset of houses with features like area, number of bedrooms, number of bathrooms, location, etc., predict the selling price of a house.
Why This Project?
- It’s a classic regression problem.
- Introduces data preprocessing, feature engineering, and model evaluation.
- Helps build real-world intuition.
Dataset
We will use the California Housing dataset
provided by sklearn.datasets
. It’s built-in and perfect for training.
❯ Why not just use area to predict price?
🔸 Because price depends on multiple features – location, number of rooms, population density, etc.
🔹 A larger house in a poor neighborhood may be cheaper than a smaller one in a prime location.
Step-by-Step Implementation
1️⃣ Load and Explore the Dataset
from sklearn.datasets import fetch_california_housing
import pandas as pd
# Load data
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target
# Display first few rows
print(df.head())
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude Target 0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526 1 8.3014 21.0 6.238137 0.971880 240.0 2.109842 37.86 -122.22 3.585 ...
🧠❯ What does the 'Target' column mean?
🔸 It represents the **median house price** in $100,000 units.
🔹 So a target of 4.526 means $452,600.
2️⃣ Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = df.drop('Target', axis=1)
y = df['Target']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Why do we scale the features?
🔸 ML models like Linear Regression perform better when features are on similar scales.
🔹 Without scaling, larger numbers may dominate the learning.
3️⃣ Train the Model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train_scaled, y_train)
4️⃣ Evaluate the Model
from sklearn.metrics import mean_squared_error, r2_score
predictions = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print("Mean Squared Error:", mse)
print("R2 Score:", r2)
Mean Squared Error: 0.530 R2 Score: 0.61
What is R2 Score?
🔸 It indicates how well your model explains the variability of the target.
🔹 R² = 1
is perfect prediction; R² = 0
means no better than average.
Bonus: Predict on New Data
import numpy as np
sample = np.array([[8.0, 30.0, 6.0, 1.0, 300.0, 2.5, 37.85, -122.2]])
sample_scaled = scaler.transform(sample)
price = model.predict(sample_scaled)
print("Predicted Price:", round(price[0] * 100000, 2))
Predicted Price: 452378.45
Summary
- You learned how to load a real dataset using
sklearn
. - You applied Linear Regression to predict house prices.
- You understood the use of scaling, evaluation metrics, and prediction.
Further Challenges
- Try using
RandomForestRegressor
and compare results. - Plot actual vs predicted prices.
- Add polynomial features for improvement.