Machine Learning - Decision Trees and Random Forest

⬅ Previous TopicMachine Learning - K-Nearest Neighbors (KNN)

Next Topic ⮕Model Evaluation Metrics in Machine Learning (with Examples & Python Code)

Decision Trees and Random Forest in Machine Learning

In this lesson, you'll learn two of the most powerful and intuitive algorithms in Machine Learning: Decision Trees and Random Forests. These algorithms are commonly used for both classification and regression problems.

What is a Decision Tree?

A Decision Tree is a flowchart-like structure where each internal node represents a decision based on a feature (e.g., "Is Age > 30?"), each branch represents the outcome of the decision, and each leaf node represents a final output or class label.

It mimics human decision-making and is easy to visualize, making it a great tool for beginners to understand how models make predictions.

Example: Deciding Loan Approval

Imagine a bank wants to decide whether to approve a loan. Here are a few simple rules they might use:

If Income > 50,000 → Approve
Else, if Credit Score > 700 → Approve
Else → Reject

This logic can be represented as a tree. The model "splits" the data at each node to separate it into categories as cleanly as possible.

Question: How does a Decision Tree decide which feature to split on?

🔹 Answer: It uses metrics like Gini Impurity or Information Gain to determine which feature creates the most pure subsets of data.

Python Code: Building a Decision Tree

Let’s use the popular scikit-learn library to train a Decision Tree classifier on the famous Titanic dataset.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)

# Select relevant features
df = df[['Survived', 'Pclass', 'Sex', 'Age']]
df.dropna(inplace=True)

# Convert categorical column 'Sex' to numeric
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# Split features and target
X = df[['Pclass', 'Sex', 'Age']]
y = df['Survived']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Predict
y_pred = clf.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.75  # Output may vary slightly

Code Explanation

df = pd.read_csv(...): Loads the Titanic dataset.
df['Sex'].map(...): Converts categorical data into numbers (0 for male, 1 for female).
DecisionTreeClassifier: scikit-learn's class for creating decision tree models.
accuracy_score: Measures how many predictions were correct.

💠 Question: Why is it important to convert categorical variables into numeric?

🔹 Answer: Most machine learning models work only with numbers. Categorical text like 'male' or 'female' must be encoded numerically.

What is a Random Forest?

A Random Forest is an ensemble method that builds multiple Decision Trees and merges them together to get a more accurate and stable prediction.

Each tree in the forest is built from a random sample of the training data (with replacement) and considers a random subset of features for splitting.

This technique reduces overfitting and increases accuracy.

Example: Predicting Survival with Random Forest

from sklearn.ensemble import RandomForestClassifier

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict
rf_pred = rf.predict(X_test)

# Evaluate
print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))

Random Forest Accuracy: 0.79  # Output may vary slightly

Code Explanation

RandomForestClassifier: An ensemble of decision trees.
n_estimators=100: Builds 100 decision trees.
random_state=42: Ensures reproducibility.

💠 Question: Why does Random Forest often perform better than a single Decision Tree?

🔹 Answer: Because it averages the results of many diverse trees, reducing variance and overfitting.

Key Differences Between Decision Tree and Random Forest

Decision Tree	Random Forest
Single model, easy to visualize	Multiple trees, complex to visualize
High chance of overfitting	Reduces overfitting by averaging
Fast training	Slightly slower due to multiple trees

💠 Question: Should we always use Random Forest over Decision Trees?

🔹 Answer: Not always. If you need interpretability and explainability, a Decision Tree is easier. But if you want better accuracy and robustness, Random Forest is usually better.

Conclusion

Decision Trees are a great start to understanding supervised learning. Random Forests build upon them to give more powerful models. Together, they are powerful tools in any ML engineer's toolbox.

In the next module, we’ll explore Model Improvement Techniques like cross-validation and hyperparameter tuning.

⬅ Previous TopicMachine Learning - K-Nearest Neighbors (KNN)

Next Topic ⮕Model Evaluation Metrics in Machine Learning (with Examples & Python Code)

Machine Learning - Decision Trees and Random Forest

Decision Trees and Random Forest in Machine Learning

What is a Decision Tree?

Example: Deciding Loan Approval

Python Code: Building a Decision Tree

Code Explanation

What is a Random Forest?

Example: Predicting Survival with Random Forest

Code Explanation

Key Differences Between Decision Tree and Random Forest

Conclusion

Module 4: Supervised Learning❯

Welcome to ProgramGuru

Player Settings