Yandex

Machine Learning for BeginnersMachine Learning for Beginners1

Hyperparameter Tuning in Machine Learning



What is Hyperparameter Tuning?

In machine learning, a hyperparameter is a configuration that is set before the training process begins. These are not learned from the data but control the learning process itself. Examples include:

  • Number of neighbors in KNeighborsClassifier
  • Maximum depth of a decision tree
  • Learning rate in gradient boosting models

Hyperparameter tuning is the process of choosing the best combination of these settings to improve the performance of a model.

🔸 Why can't we just use default hyperparameters?

Default values may work fine, but they’re generic. Tuning helps you squeeze more accuracy from your model for your specific dataset.

🔹 Example 1: Tuning KNeighborsClassifier using GridSearchCV

Let’s tune the number of neighbors (n_neighbors) in a K-Nearest Neighbors classifier.

from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV, train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# KNN with GridSearch
param_grid = {
    'n_neighbors': [1, 3, 5, 7, 9]
}

grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)
Best parameters: {'n_neighbors': 3}
Best score: 0.9619

Explanation

  • GridSearchCV tries every combination of the parameters you define.
  • cv=5 means 5-fold cross-validation. It splits the training data into 5 parts and evaluates each combination.
  • best_params_ gives you the best value of n_neighbors.

✦ Question: Why do we use cross-validation instead of testing on test data directly?

Answer: Because test data should be untouched until final evaluation. Cross-validation ensures the model generalizes well before we use the test set.


🔹 Example 2: Tuning Decision Tree using multiple hyperparameters

In a decision tree, some important hyperparameters are max_depth and min_samples_split.

from sklearn.tree import DecisionTreeClassifier

param_grid = {
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)
Best parameters: {'max_depth': 3, 'min_samples_split': 2}
Best score: 0.9428

Explanation

  • max_depth controls how deep the tree can go. Deeper trees may overfit.
  • min_samples_split controls the minimum number of samples required to split a node.
  • The best combination is selected based on cross-validation performance.

✦ Question: What happens if we don’t restrict the depth of a decision tree?

Answer: The tree will grow deep and might overfit the training data, performing poorly on unseen data.


🔹 RandomizedSearchCV vs GridSearchCV

GridSearchCV tries all combinations exhaustively. This is great for small search spaces but becomes slow with many parameters.

RandomizedSearchCV tries only a fixed number of random combinations, making it faster.

🔸 Use Case: RandomizedSearchCV for Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

param_dist = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

random_search = RandomizedSearchCV(RandomForestClassifier(), param_distributions=param_dist, n_iter=5, cv=3, random_state=42)
random_search.fit(X_train, y_train)

print("Best parameters:", random_search.best_params_)
print("Best score:", random_search.best_score_)
Best parameters: {'n_estimators': 200, 'min_samples_split': 2, 'max_depth': 30}
Best score: 0.9619

Explanation

  • n_iter=5 means only 5 random combinations are tried.
  • Faster for large spaces, with only a slight risk of missing the exact best combo.

✦ Question: Should we always use RandomizedSearch for big models?

Answer: Yes, especially when you have many hyperparameters. It’s efficient and provides near-optimal results faster than grid search.


Final Tips for Hyperparameter Tuning

  • Start with Grid Search for small models like KNN or Decision Trees
  • Use Randomized Search for complex models like Random Forest, XGBoost, etc.
  • Always combine with cross-validation to avoid overfitting
  • After tuning, evaluate on the unseen test set

Summary

Hyperparameter tuning is a critical step to improve your machine learning model’s performance. It helps find the optimal settings that generalize well on new data.

Mastering tools like GridSearchCV and RandomizedSearchCV will make your ML workflow robust and production-ready.



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

You can support this website with a contribution of your choice.

When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M