Model Evaluation Metrics in Machine Learning (with Examples & Python Code)

What are Model Evaluation Metrics?

In supervised learning, once we build a machine learning model, we need to evaluate how well it performs. Model evaluation metrics help us understand how accurate and reliable our predictions are, especially for classification problems.

Why can't we just use accuracy all the time?

Accuracy is a good starting point, but it may be misleading when the dataset is imbalanced. That's why we use multiple evaluation metrics to get a complete picture of model performance.

Example Scenario: Email Spam Classifier

Imagine we built a model that classifies emails as either Spam or Not Spam. Out of 100 emails, here's what our model predicted:

Correctly identified 70 spam emails → True Positives (TP)
Incorrectly labeled 10 legitimate emails as spam → False Positives (FP)
Correctly identified 15 non-spam emails → True Negatives (TN)
Missed 5 spam emails and marked them as non-spam → False Negatives (FN)

Confusion Matrix:

             Predicted
              Spam   Not Spam
Actual Spam     70        5
Not Spam        10       15

Accuracy

Accuracy = (TP + TN) / (TP + FP + FN + TN)

= (70 + 15) / (70 + 10 + 5 + 15) = 85 / 100 = 0.85 or 85%

👉 Accuracy tells us how many total predictions were correct. But what if the classes were imbalanced?

❖ Question: What happens if 95 emails were Not Spam and 5 were Spam?

➤ If the model predicts all as Not Spam, accuracy = 95%. But it missed all spam! So accuracy alone is misleading.

Precision

Precision = TP / (TP + FP) = 70 / (70 + 10) = 70 / 80 = 0.875

👉 Precision tells us: "Of all emails predicted as spam, how many were actually spam?"

❖ Question: When is high precision important?

➤ In spam detection, high precision reduces the risk of flagging important emails as spam.

Recall

Recall = TP / (TP + FN) = 70 / (70 + 5) = 70 / 75 = 0.933

👉 Recall tells us: "Of all actual spam emails, how many did the model detect?"

❖ Question: When is high recall more important than precision?

➤ In medical diagnosis, missing a disease (false negative) is more dangerous than false alarms.

F1-Score

F1-score is the harmonic mean of Precision and Recall. It balances the trade-off between the two.

F1 = 2 * (Precision * Recall) / (Precision + Recall)

= 2 * (0.875 * 0.933) / (0.875 + 0.933) ≈ 0.903

Summary Table:

Metric	Formula	Value
Accuracy	(TP + TN) / Total	0.85
Precision	TP / (TP + FP)	0.875
Recall	TP / (TP + FN)	0.933
F1-Score	2 * (P * R) / (P + R)	0.903

Python Implementation

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Actual and predicted values for binary classification (1 = Spam, 0 = Not Spam)
y_true = [1]*70 + [0]*10 + [0]*15 + [1]*5  # 70 TP, 10 FP, 15 TN, 5 FN
y_pred = [1]*70 + [1]*10 + [0]*15 + [0]*5  # matching predictions

# Evaluation metrics
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)

print("Confusion Matrix:")
print(cm)
print("Accuracy:", acc)
print("Precision:", prec)
print("Recall:", rec)
print("F1 Score:", f1)

Confusion Matrix:
[[15 10]
  [ 5 70]]
Accuracy: 0.85
Precision: 0.875
Recall: 0.9333333333333333
F1 Score: 0.9032258064516129

Code Description

y_true – Simulated actual labels (first 70 are TP, last 5 are FN)
y_pred – Simulated predicted labels
confusion_matrix – Shows TP, FP, FN, TN in matrix form
accuracy_score – Measures overall correctness
precision_score – Of predicted positives, how many are correct
recall_score – Of actual positives, how many were predicted
f1_score – Balanced mean of precision and recall

Final Thought

Always choose evaluation metrics based on the business problem. Use precision when false positives are costly. Use recall when false negatives are dangerous. Use F1-score when you want a balance between the two.

⬅ Previous TopicMachine Learning - Decision Trees and Random Forest

Next Topic ⮕K-Means Clustering in Machine Learning: Explained with Examples and Python Code

Comments

Loading comments...