Decision Trees and Random Forest in Machine Learning
In this lesson, you'll learn two of the most powerful and intuitive algorithms in Machine Learning: Decision Trees and Random Forests. These algorithms are commonly used for both classification and regression problems.
What is a Decision Tree?
A Decision Tree is a flowchart-like structure where each internal node represents a decision based on a feature (e.g., "Is Age > 30?"), each branch represents the outcome of the decision, and each leaf node represents a final output or class label.
It mimics human decision-making and is easy to visualize, making it a great tool for beginners to understand how models make predictions.
Example: Deciding Loan Approval
Imagine a bank wants to decide whether to approve a loan. Here are a few simple rules they might use:
- If Income > 50,000 → Approve
- Else, if Credit Score > 700 → Approve
- Else → Reject
This logic can be represented as a tree. The model "splits" the data at each node to separate it into categories as cleanly as possible.
🧠 Question: How does a Decision Tree decide which feature to split on?
🔹 Answer: It uses metrics like Gini Impurity
or Information Gain
to determine which feature creates the most pure subsets of data.
Python Code: Building a Decision Tree
Let’s use the popular scikit-learn
library to train a Decision Tree classifier on the famous Titanic dataset.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Load dataset
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)
# Select relevant features
df = df[['Survived', 'Pclass', 'Sex', 'Age']]
df.dropna(inplace=True)
# Convert categorical column 'Sex' to numeric
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
# Split features and target
X = df[['Pclass', 'Sex', 'Age']]
y = df['Survived']
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train Decision Tree
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
# Predict
y_pred = clf.predict(X_test)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
Output:
Accuracy: 0.75 # Output may vary slightly
Code Explanation
- df = pd.read_csv(...): Loads the Titanic dataset.
- df['Sex'].map(...): Converts categorical data into numbers (0 for male, 1 for female).
- DecisionTreeClassifier: scikit-learn's class for creating decision tree models.
- accuracy_score: Measures how many predictions were correct.
💠 Question: Why is it important to convert categorical variables into numeric?
🔹 Answer: Most machine learning models work only with numbers. Categorical text like 'male' or 'female' must be encoded numerically.
What is a Random Forest?
A Random Forest is an ensemble method that builds multiple Decision Trees and merges them together to get a more accurate and stable prediction.
Each tree in the forest is built from a random sample of the training data (with replacement) and considers a random subset of features for splitting.
This technique reduces overfitting and increases accuracy.
Example: Predicting Survival with Random Forest
from sklearn.ensemble import RandomForestClassifier
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Predict
rf_pred = rf.predict(X_test)
# Evaluate
print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))
Output:
Random Forest Accuracy: 0.79 # Output may vary slightly
Code Explanation
- RandomForestClassifier: An ensemble of decision trees.
- n_estimators=100: Builds 100 decision trees.
- random_state=42: Ensures reproducibility.
💠 Question: Why does Random Forest often perform better than a single Decision Tree?
🔹 Answer: Because it averages the results of many diverse trees, reducing variance and overfitting.
Key Differences Between Decision Tree and Random Forest
Decision Tree | Random Forest |
---|---|
Single model, easy to visualize | Multiple trees, complex to visualize |
High chance of overfitting | Reduces overfitting by averaging |
Fast training | Slightly slower due to multiple trees |
💠 Question: Should we always use Random Forest over Decision Trees?
🔹 Answer: Not always. If you need interpretability and explainability, a Decision Tree is easier. But if you want better accuracy and robustness, Random Forest is usually better.
Conclusion
Decision Trees are a great start to understanding supervised learning. Random Forests build upon them to give more powerful models. Together, they are powerful tools in any ML engineer's toolbox.
In the next module, we’ll explore Model Improvement Techniques like cross-validation and hyperparameter tuning.