Machine Learning for BeginnersMachine Learning for Beginners1

Model Evaluation Metrics in Machine Learning (with Examples & Python Code)



What are Model Evaluation Metrics?

In supervised learning, once we build a machine learning model, we need to evaluate how well it performs. Model evaluation metrics help us understand how accurate and reliable our predictions are, especially for classification problems.

Why can't we just use accuracy all the time?

Accuracy is a good starting point, but it may be misleading when the dataset is imbalanced. That's why we use multiple evaluation metrics to get a complete picture of model performance.

Example Scenario: Email Spam Classifier

Imagine we built a model that classifies emails as either Spam or Not Spam. Out of 100 emails, here's what our model predicted:

Confusion Matrix:

             Predicted
              Spam   Not Spam
Actual Spam     70        5
Not Spam        10       15

Accuracy

Accuracy = (TP + TN) / (TP + FP + FN + TN)

= (70 + 15) / (70 + 10 + 5 + 15) = 85 / 100 = 0.85 or 85%

👉 Accuracy tells us how many total predictions were correct. But what if the classes were imbalanced?

❖ Question: What happens if 95 emails were Not Spam and 5 were Spam?

➤ If the model predicts all as Not Spam, accuracy = 95%. But it missed all spam! So accuracy alone is misleading.

Precision

Precision = TP / (TP + FP) = 70 / (70 + 10) = 70 / 80 = 0.875

👉 Precision tells us: "Of all emails predicted as spam, how many were actually spam?"

❖ Question: When is high precision important?

➤ In spam detection, high precision reduces the risk of flagging important emails as spam.

Recall

Recall = TP / (TP + FN) = 70 / (70 + 5) = 70 / 75 = 0.933

👉 Recall tells us: "Of all actual spam emails, how many did the model detect?"

❖ Question: When is high recall more important than precision?

➤ In medical diagnosis, missing a disease (false negative) is more dangerous than false alarms.

F1-Score

F1-score is the harmonic mean of Precision and Recall. It balances the trade-off between the two.

F1 = 2 * (Precision * Recall) / (Precision + Recall)

= 2 * (0.875 * 0.933) / (0.875 + 0.933) ≈ 0.903

Summary Table:

Metric Formula Value
Accuracy(TP + TN) / Total0.85
PrecisionTP / (TP + FP)0.875
RecallTP / (TP + FN)0.933
F1-Score2 * (P * R) / (P + R)0.903

Python Implementation

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Actual and predicted values for binary classification (1 = Spam, 0 = Not Spam)
y_true = [1]*70 + [0]*10 + [0]*15 + [1]*5  # 70 TP, 10 FP, 15 TN, 5 FN
y_pred = [1]*70 + [1]*10 + [0]*15 + [0]*5  # matching predictions

# Evaluation metrics
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)

print("Confusion Matrix:")
print(cm)
print("Accuracy:", acc)
print("Precision:", prec)
print("Recall:", rec)
print("F1 Score:", f1)
Confusion Matrix:
[[15 10]
 [ 5 70]]
Accuracy: 0.85
Precision: 0.875
Recall: 0.9333333333333333
F1 Score: 0.9032258064516129

Code Description

Final Thought

Always choose evaluation metrics based on the business problem. Use precision when false positives are costly. Use recall when false negatives are dangerous. Use F1-score when you want a balance between the two.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M