House Price Prediction Using Machine Learning (Step-by-Step Tutorial for Beginners)

House Price Prediction – Real-World ML Project

In this beginner-friendly project, we'll use machine learning to predict house prices based on various features like size, location, and number of rooms. This is one of the most popular real-world use cases of ML in action!

Problem Statement

Given a dataset of houses with features like area, number of bedrooms, number of bathrooms, location, etc., predict the selling price of a house.

Why This Project?

It’s a classic regression problem.
Introduces data preprocessing, feature engineering, and model evaluation.
Helps build real-world intuition.

Dataset

We will use the California Housing dataset provided by sklearn.datasets. It’s built-in and perfect for training.

❯ Why not just use area to predict price?

🔸 Because price depends on multiple features – location, number of rooms, population density, etc.

🔹 A larger house in a poor neighborhood may be cheaper than a smaller one in a prime location.

Step-by-Step Implementation

1️⃣ Load and Explore the Dataset

from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load data
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target

# Display first few rows
print(df.head())

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  Longitude  Target
0  8.3252      41.0  6.984127   1.023810      322.0  2.555556     37.88    -122.23   4.526
1  8.3014      21.0  6.238137   0.971880      240.0  2.109842     37.86    -122.22   3.585
...

🧠❯ What does the 'Target' column mean?

🔸 It represents the **median house price** in $100,000 units.

🔹 So a target of 4.526 means $452,600.

2️⃣ Data Preprocessing

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df.drop('Target', axis=1)
y = df['Target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Why do we scale the features?

🔸 ML models like Linear Regression perform better when features are on similar scales.

🔹 Without scaling, larger numbers may dominate the learning.

3️⃣ Train the Model

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train_scaled, y_train)

4️⃣ Evaluate the Model

from sklearn.metrics import mean_squared_error, r2_score

predictions = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("Mean Squared Error:", mse)
print("R2 Score:", r2)

Mean Squared Error: 0.530
R2 Score: 0.61

What is R2 Score?

🔸 It indicates how well your model explains the variability of the target.

🔹 R² = 1 is perfect prediction; R² = 0 means no better than average.

Bonus: Predict on New Data

import numpy as np

sample = np.array([[8.0, 30.0, 6.0, 1.0, 300.0, 2.5, 37.85, -122.2]])
sample_scaled = scaler.transform(sample)
price = model.predict(sample_scaled)
print("Predicted Price:", round(price[0] * 100000, 2))

Predicted Price: 452378.45

Summary

You learned how to load a real dataset using sklearn.
You applied Linear Regression to predict house prices.
You understood the use of scaling, evaluation metrics, and prediction.

Further Challenges

Try using RandomForestRegressor and compare results.
Plot actual vs predicted prices.
Add polynomial features for improvement.

⬅ Previous TopicTitanic Survival Prediction Using Machine Learning

Next Topic ⮕Customer Segmentation Using Clustering - Machine Learning Project for Beginners