⬅ Previous Topic
Dimensionality Reduction with PCA in Machine LearningNext Topic ⮕
Hyperparameter Tuning in Machine Learning⬅ Previous Topic
Dimensionality Reduction with PCA in Machine LearningNext Topic ⮕
Hyperparameter Tuning in Machine LearningWhen we train a machine learning model, we typically split the data into training and test sets. But there's a challenge—how do we know if the model will generalize well to unseen data? This is where Cross-Validation comes in.
Cross-validation helps us:
It's a technique for evaluating a model by training it on different subsets of the data and validating it on the remaining parts. The idea is to rotate the training and testing portions to test the model's robustness.
In K-Fold Cross-Validation:
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(max_iter=200)
scores = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
acc = accuracy_score(y_test, predictions)
scores.append(acc)
print("Fold Accuracies:", scores)
print("Average Accuracy:", np.mean(scores))
Fold Accuracies: [0.9667, 1.0, 0.9, 0.9333, 1.0] Average Accuracy: 0.96
KFold(n_splits=5)
splits the data into 5 folds.model.fit()
trains the logistic regression model.accuracy_score()
calculates how well the model performed on each fold.✔️ Because data might have patterns (e.g., sorted by class). Shuffling ensures randomness and avoids biased splits.
Stratified K-Fold ensures that each fold has the same class distribution as the original dataset. This is especially useful when the dataset is imbalanced.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(max_iter=200)
stratified_scores = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
acc = accuracy_score(y_test, predictions)
stratified_scores.append(acc)
print("Stratified Fold Accuracies:", stratified_scores)
print("Average Accuracy:", np.mean(stratified_scores))
Stratified Fold Accuracies: [1.0, 0.9667, 0.9333, 0.9, 1.0] Average Accuracy: 0.96
✔️ When your dataset has imbalanced classes. It preserves class ratio in each fold.
In LOOCV, we leave only one sample for testing and use all other samples for training. This is repeated for every data point.
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
model = LogisticRegression(max_iter=200)
loo_scores = []
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
prediction = model.predict(X_test)
acc = accuracy_score(y_test, prediction)
loo_scores.append(acc)
print("LOOCV Accuracy:", np.mean(loo_scores))
LOOCV Accuracy: 0.9533
LeaveOneOut()
splits the dataset such that each sample is tested once.✖️ Not always. It's computationally expensive, especially for large datasets. Use it when data is limited and computation time is not a concern.
Cross-validation is essential for evaluating how well your model performs on unseen data. While K-Fold is the most commonly used, Stratified K-Fold is better for classification with class imbalance, and LOOCV is great when data is scarce.
👉 Choose your technique based on your data size, class balance, and available computational power.
⬅ Previous Topic
Dimensionality Reduction with PCA in Machine LearningNext Topic ⮕
Hyperparameter Tuning in Machine LearningYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.