Cross-Validation Techniques in Machine Learning (With Examples)

Cross-Validation Techniques

When we train a machine learning model, we typically split the data into training and test sets. But there's a challenge—how do we know if the model will generalize well to unseen data? This is where Cross-Validation comes in.

Why Use Cross-Validation?

Cross-validation helps us:

Check the stability and reliability of the model.
Use the data more efficiently.
Reduce the risk of overfitting or underfitting.

🔹 What is Cross-Validation?

It's a technique for evaluating a model by training it on different subsets of the data and validating it on the remaining parts. The idea is to rotate the training and testing portions to test the model's robustness.

1. K-Fold Cross-Validation

In K-Fold Cross-Validation:

The dataset is split into K equal parts (folds).
The model trains on K-1 folds and tests on the remaining one.
This process repeats K times, each time using a different fold for testing.
The average of the evaluation scores gives a better measure of model performance.

🟢 Example: K-Fold with 5 Splits

from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(max_iter=200)

scores = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    acc = accuracy_score(y_test, predictions)
    scores.append(acc)

print("Fold Accuracies:", scores)
print("Average Accuracy:", np.mean(scores))

Fold Accuracies: [0.9667, 1.0, 0.9, 0.9333, 1.0]
Average Accuracy: 0.96

Code Explanation:

KFold(n_splits=5) splits the data into 5 folds.
model.fit() trains the logistic regression model.
accuracy_score() calculates how well the model performed on each fold.
The average accuracy gives us a more reliable estimate than a single train-test split.

Why shuffle the data before splitting?

✔️ Because data might have patterns (e.g., sorted by class). Shuffling ensures randomness and avoids biased splits.

2. Stratified K-Fold Cross-Validation

Stratified K-Fold ensures that each fold has the same class distribution as the original dataset. This is especially useful when the dataset is imbalanced.

🟢 Example: Stratified K-Fold

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(max_iter=200)

stratified_scores = []

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    acc = accuracy_score(y_test, predictions)
    stratified_scores.append(acc)

print("Stratified Fold Accuracies:", stratified_scores)
print("Average Accuracy:", np.mean(stratified_scores))

Stratified Fold Accuracies: [1.0, 0.9667, 0.9333, 0.9, 1.0]
Average Accuracy: 0.96

When should you use Stratified K-Fold instead of regular K-Fold?

✔️ When your dataset has imbalanced classes. It preserves class ratio in each fold.

3. Leave-One-Out Cross-Validation (LOOCV)

In LOOCV, we leave only one sample for testing and use all other samples for training. This is repeated for every data point.

🟢 Example: LOOCV

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
model = LogisticRegression(max_iter=200)

loo_scores = []

for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    acc = accuracy_score(y_test, prediction)
    loo_scores.append(acc)

print("LOOCV Accuracy:", np.mean(loo_scores))

LOOCV Accuracy: 0.9533

Code Explanation:

LeaveOneOut() splits the dataset such that each sample is tested once.
High variance in results is possible if the dataset is small or noisy.
This is very exhaustive but gives the best estimate of generalization.

Should I always use LOOCV for best accuracy?

✖️ Not always. It's computationally expensive, especially for large datasets. Use it when data is limited and computation time is not a concern.

Conclusion

Cross-validation is essential for evaluating how well your model performs on unseen data. While K-Fold is the most commonly used, Stratified K-Fold is better for classification with class imbalance, and LOOCV is great when data is scarce.

👉 Choose your technique based on your data size, class balance, and available computational power.

⬅ Previous TopicDimensionality Reduction with PCA in Machine Learning

Next Topic ⮕Hyperparameter Tuning in Machine Learning

Cross-Validation Techniques in Machine Learning (With Examples)

Cross-Validation Techniques

Why Use Cross-Validation?

🔹 What is Cross-Validation?

1. K-Fold Cross-Validation

🟢 Example: K-Fold with 5 Splits

Code Explanation:

Why shuffle the data before splitting?

2. Stratified K-Fold Cross-Validation

🟢 Example: Stratified K-Fold

When should you use Stratified K-Fold instead of regular K-Fold?

3. Leave-One-Out Cross-Validation (LOOCV)

🟢 Example: LOOCV

Code Explanation:

Should I always use LOOCV for best accuracy?

Conclusion

Module 6: Improving Models❯

Player Settings