What is Hyperparameter Tuning?
In machine learning, a hyperparameter is a configuration that is set before the training process begins. These are not learned from the data but control the learning process itself. Examples include:
- Number of neighbors in
KNeighborsClassifier
- Maximum depth of a decision tree
- Learning rate in gradient boosting models
Hyperparameter tuning is the process of choosing the best combination of these settings to improve the performance of a model.
🔸 Why can't we just use default hyperparameters?
Default values may work fine, but they’re generic. Tuning helps you squeeze more accuracy from your model for your specific dataset.
🔹 Example 1: Tuning KNeighborsClassifier
using GridSearchCV
Let’s tune the number of neighbors (n_neighbors
) in a K-Nearest Neighbors classifier.
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# KNN with GridSearch
param_grid = {
'n_neighbors': [1, 3, 5, 7, 9]
}
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)
Best parameters: {'n_neighbors': 3} Best score: 0.9619
Explanation
GridSearchCV
tries every combination of the parameters you define.cv=5
means 5-fold cross-validation. It splits the training data into 5 parts and evaluates each combination.best_params_
gives you the best value ofn_neighbors
.
✦ Question: Why do we use cross-validation instead of testing on test data directly?
✧ Answer: Because test data should be untouched until final evaluation. Cross-validation ensures the model generalizes well before we use the test set.
🔹 Example 2: Tuning Decision Tree using multiple hyperparameters
In a decision tree, some important hyperparameters are max_depth
and min_samples_split
.
from sklearn.tree import DecisionTreeClassifier
param_grid = {
'max_depth': [3, 5, 7, None],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)
Best parameters: {'max_depth': 3, 'min_samples_split': 2} Best score: 0.9428
Explanation
max_depth
controls how deep the tree can go. Deeper trees may overfit.min_samples_split
controls the minimum number of samples required to split a node.- The best combination is selected based on cross-validation performance.
✦ Question: What happens if we don’t restrict the depth of a decision tree?
✧ Answer: The tree will grow deep and might overfit the training data, performing poorly on unseen data.
🔹 RandomizedSearchCV vs GridSearchCV
GridSearchCV tries all combinations exhaustively. This is great for small search spaces but becomes slow with many parameters.
RandomizedSearchCV tries only a fixed number of random combinations, making it faster.
🔸 Use Case: RandomizedSearchCV for Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
param_dist = {
'n_estimators': [50, 100, 150, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
random_search = RandomizedSearchCV(RandomForestClassifier(), param_distributions=param_dist, n_iter=5, cv=3, random_state=42)
random_search.fit(X_train, y_train)
print("Best parameters:", random_search.best_params_)
print("Best score:", random_search.best_score_)
Best parameters: {'n_estimators': 200, 'min_samples_split': 2, 'max_depth': 30} Best score: 0.9619
Explanation
n_iter=5
means only 5 random combinations are tried.- Faster for large spaces, with only a slight risk of missing the exact best combo.
✦ Question: Should we always use RandomizedSearch for big models?
✧ Answer: Yes, especially when you have many hyperparameters. It’s efficient and provides near-optimal results faster than grid search.
Final Tips for Hyperparameter Tuning
- Start with Grid Search for small models like KNN or Decision Trees
- Use Randomized Search for complex models like Random Forest, XGBoost, etc.
- Always combine with cross-validation to avoid overfitting
- After tuning, evaluate on the unseen test set
Summary
Hyperparameter tuning is a critical step to improve your machine learning model’s performance. It helps find the optimal settings that generalize well on new data.
Mastering tools like GridSearchCV
and RandomizedSearchCV
will make your ML workflow robust and production-ready.