Cross-Validation Techniques
When we train a machine learning model, we typically split the data into training and test sets. But there's a challenge—how do we know if the model will generalize well to unseen data? This is where Cross-Validation comes in.
Why Use Cross-Validation?
Cross-validation helps us:
- Check the stability and reliability of the model.
- Use the data more efficiently.
- Reduce the risk of overfitting or underfitting.
🔹 What is Cross-Validation?
It's a technique for evaluating a model by training it on different subsets of the data and validating it on the remaining parts. The idea is to rotate the training and testing portions to test the model's robustness.
1. K-Fold Cross-Validation
In K-Fold Cross-Validation:
- The dataset is split into K equal parts (folds).
- The model trains on K-1 folds and tests on the remaining one.
- This process repeats K times, each time using a different fold for testing.
- The average of the evaluation scores gives a better measure of model performance.
🟢 Example: K-Fold with 5 Splits
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(max_iter=200)
scores = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
acc = accuracy_score(y_test, predictions)
scores.append(acc)
print("Fold Accuracies:", scores)
print("Average Accuracy:", np.mean(scores))
Fold Accuracies: [0.9667, 1.0, 0.9, 0.9333, 1.0] Average Accuracy: 0.96
Code Explanation:
KFold(n_splits=5)
splits the data into 5 folds.model.fit()
trains the logistic regression model.accuracy_score()
calculates how well the model performed on each fold.- The average accuracy gives us a more reliable estimate than a single train-test split.
Why shuffle the data before splitting?
✔️ Because data might have patterns (e.g., sorted by class). Shuffling ensures randomness and avoids biased splits.
2. Stratified K-Fold Cross-Validation
Stratified K-Fold ensures that each fold has the same class distribution as the original dataset. This is especially useful when the dataset is imbalanced.
🟢 Example: Stratified K-Fold
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(max_iter=200)
stratified_scores = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
acc = accuracy_score(y_test, predictions)
stratified_scores.append(acc)
print("Stratified Fold Accuracies:", stratified_scores)
print("Average Accuracy:", np.mean(stratified_scores))
Stratified Fold Accuracies: [1.0, 0.9667, 0.9333, 0.9, 1.0] Average Accuracy: 0.96
When should you use Stratified K-Fold instead of regular K-Fold?
✔️ When your dataset has imbalanced classes. It preserves class ratio in each fold.
3. Leave-One-Out Cross-Validation (LOOCV)
In LOOCV, we leave only one sample for testing and use all other samples for training. This is repeated for every data point.
🟢 Example: LOOCV
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
model = LogisticRegression(max_iter=200)
loo_scores = []
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
prediction = model.predict(X_test)
acc = accuracy_score(y_test, prediction)
loo_scores.append(acc)
print("LOOCV Accuracy:", np.mean(loo_scores))
LOOCV Accuracy: 0.9533
Code Explanation:
LeaveOneOut()
splits the dataset such that each sample is tested once.- High variance in results is possible if the dataset is small or noisy.
- This is very exhaustive but gives the best estimate of generalization.
Should I always use LOOCV for best accuracy?
✖️ Not always. It's computationally expensive, especially for large datasets. Use it when data is limited and computation time is not a concern.
Conclusion
Cross-validation is essential for evaluating how well your model performs on unseen data. While K-Fold is the most commonly used, Stratified K-Fold is better for classification with class imbalance, and LOOCV is great when data is scarce.
👉 Choose your technique based on your data size, class balance, and available computational power.