What are Model Evaluation Metrics?
In supervised learning, once we build a machine learning model, we need to evaluate how well it performs. Model evaluation metrics help us understand how accurate and reliable our predictions are, especially for classification problems.
Why can't we just use accuracy all the time?
Accuracy is a good starting point, but it may be misleading when the dataset is imbalanced. That's why we use multiple evaluation metrics to get a complete picture of model performance.
Example Scenario: Email Spam Classifier
Imagine we built a model that classifies emails as either Spam or Not Spam. Out of 100 emails, here's what our model predicted:
- Correctly identified 70 spam emails → True Positives (TP)
- Incorrectly labeled 10 legitimate emails as spam → False Positives (FP)
- Correctly identified 15 non-spam emails → True Negatives (TN)
- Missed 5 spam emails and marked them as non-spam → False Negatives (FN)
Confusion Matrix:
Predicted Spam Not Spam Actual Spam 70 5 Not Spam 10 15
Accuracy
Accuracy = (TP + TN) / (TP + FP + FN + TN)
= (70 + 15) / (70 + 10 + 5 + 15) = 85 / 100 = 0.85 or 85%
👉 Accuracy tells us how many total predictions were correct. But what if the classes were imbalanced?
❖ Question: What happens if 95 emails were Not Spam and 5 were Spam?
➤ If the model predicts all as Not Spam, accuracy = 95%. But it missed all spam! So accuracy alone is misleading.
Precision
Precision = TP / (TP + FP) = 70 / (70 + 10) = 70 / 80 = 0.875
👉 Precision tells us: "Of all emails predicted as spam, how many were actually spam?"
❖ Question: When is high precision important?
➤ In spam detection, high precision reduces the risk of flagging important emails as spam.
Recall
Recall = TP / (TP + FN) = 70 / (70 + 5) = 70 / 75 = 0.933
👉 Recall tells us: "Of all actual spam emails, how many did the model detect?"
❖ Question: When is high recall more important than precision?
➤ In medical diagnosis, missing a disease (false negative) is more dangerous than false alarms.
F1-Score
F1-score is the harmonic mean of Precision and Recall. It balances the trade-off between the two.
F1 = 2 * (Precision * Recall) / (Precision + Recall)
= 2 * (0.875 * 0.933) / (0.875 + 0.933) ≈ 0.903
Summary Table:
Metric | Formula | Value |
---|---|---|
Accuracy | (TP + TN) / Total | 0.85 |
Precision | TP / (TP + FP) | 0.875 |
Recall | TP / (TP + FN) | 0.933 |
F1-Score | 2 * (P * R) / (P + R) | 0.903 |
Python Implementation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
# Actual and predicted values for binary classification (1 = Spam, 0 = Not Spam)
y_true = [1]*70 + [0]*10 + [0]*15 + [1]*5 # 70 TP, 10 FP, 15 TN, 5 FN
y_pred = [1]*70 + [1]*10 + [0]*15 + [0]*5 # matching predictions
# Evaluation metrics
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
print("Accuracy:", acc)
print("Precision:", prec)
print("Recall:", rec)
print("F1 Score:", f1)
Confusion Matrix: [[15 10] [ 5 70]] Accuracy: 0.85 Precision: 0.875 Recall: 0.9333333333333333 F1 Score: 0.9032258064516129
Code Description
y_true
– Simulated actual labels (first 70 are TP, last 5 are FN)y_pred
– Simulated predicted labelsconfusion_matrix
– Shows TP, FP, FN, TN in matrix formaccuracy_score
– Measures overall correctnessprecision_score
– Of predicted positives, how many are correctrecall_score
– Of actual positives, how many were predictedf1_score
– Balanced mean of precision and recall
Final Thought
Always choose evaluation metrics based on the business problem. Use precision when false positives are costly. Use recall when false negatives are dangerous. Use F1-score when you want a balance between the two.