Yandex

Machine Learning for BeginnersMachine Learning for Beginners1

Cross-Validation Techniques in Machine Learning (With Examples)



Cross-Validation Techniques

When we train a machine learning model, we typically split the data into training and test sets. But there's a challenge—how do we know if the model will generalize well to unseen data? This is where Cross-Validation comes in.

Why Use Cross-Validation?

Cross-validation helps us:

  • Check the stability and reliability of the model.
  • Use the data more efficiently.
  • Reduce the risk of overfitting or underfitting.

🔹 What is Cross-Validation?

It's a technique for evaluating a model by training it on different subsets of the data and validating it on the remaining parts. The idea is to rotate the training and testing portions to test the model's robustness.


1. K-Fold Cross-Validation

In K-Fold Cross-Validation:

  • The dataset is split into K equal parts (folds).
  • The model trains on K-1 folds and tests on the remaining one.
  • This process repeats K times, each time using a different fold for testing.
  • The average of the evaluation scores gives a better measure of model performance.

🟢 Example: K-Fold with 5 Splits

from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(max_iter=200)

scores = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    acc = accuracy_score(y_test, predictions)
    scores.append(acc)

print("Fold Accuracies:", scores)
print("Average Accuracy:", np.mean(scores))
Fold Accuracies: [0.9667, 1.0, 0.9, 0.9333, 1.0]
Average Accuracy: 0.96

Code Explanation:

  • KFold(n_splits=5) splits the data into 5 folds.
  • model.fit() trains the logistic regression model.
  • accuracy_score() calculates how well the model performed on each fold.
  • The average accuracy gives us a more reliable estimate than a single train-test split.

Why shuffle the data before splitting?

✔️ Because data might have patterns (e.g., sorted by class). Shuffling ensures randomness and avoids biased splits.


2. Stratified K-Fold Cross-Validation

Stratified K-Fold ensures that each fold has the same class distribution as the original dataset. This is especially useful when the dataset is imbalanced.

🟢 Example: Stratified K-Fold

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(max_iter=200)

stratified_scores = []

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    acc = accuracy_score(y_test, predictions)
    stratified_scores.append(acc)

print("Stratified Fold Accuracies:", stratified_scores)
print("Average Accuracy:", np.mean(stratified_scores))
Stratified Fold Accuracies: [1.0, 0.9667, 0.9333, 0.9, 1.0]
Average Accuracy: 0.96

When should you use Stratified K-Fold instead of regular K-Fold?

✔️ When your dataset has imbalanced classes. It preserves class ratio in each fold.


3. Leave-One-Out Cross-Validation (LOOCV)

In LOOCV, we leave only one sample for testing and use all other samples for training. This is repeated for every data point.

🟢 Example: LOOCV

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
model = LogisticRegression(max_iter=200)

loo_scores = []

for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    acc = accuracy_score(y_test, prediction)
    loo_scores.append(acc)

print("LOOCV Accuracy:", np.mean(loo_scores))
LOOCV Accuracy: 0.9533

Code Explanation:

  • LeaveOneOut() splits the dataset such that each sample is tested once.
  • High variance in results is possible if the dataset is small or noisy.
  • This is very exhaustive but gives the best estimate of generalization.

Should I always use LOOCV for best accuracy?

✖️ Not always. It's computationally expensive, especially for large datasets. Use it when data is limited and computation time is not a concern.


Conclusion

Cross-validation is essential for evaluating how well your model performs on unseen data. While K-Fold is the most commonly used, Stratified K-Fold is better for classification with class imbalance, and LOOCV is great when data is scarce.

👉 Choose your technique based on your data size, class balance, and available computational power.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

You can support this website with a contribution of your choice.

When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M