Yandex

Machine Learning for BeginnersMachine Learning for Beginners1

Train-Test Split in Machine Learning (With Examples and Python Code)



What is Train-Test Split in Machine Learning?

In machine learning, the train-test split is a technique to evaluate how well your model will perform on unseen data. It simply means dividing your dataset into two parts:

  • Training Set: Used to train the model.
  • Testing Set: Used to test how well the model performs on data it has never seen before.

This helps you avoid a common mistake called overfitting—where a model performs well on the training data but poorly on new, unseen data.

Why is Train-Test Split Needed?

Let’s imagine you're studying for an exam. If you keep revising the same questions, you might get perfect at answering them—but will you perform well if the questions change slightly in the real test? Probably not.

Similarly, in ML, we use the test set to simulate unseen "exam questions" to check if our model truly understands the patterns or is just memorizing.

Question:

What might happen if you train and test your model on the same dataset?

Answer:

You may get a high accuracy, but it will be misleading. The model hasn't been tested on unseen data, so we can't trust it to generalize well.

How Much Data Should You Use for Testing?

There's no single rule, but commonly:

  • 80% training and 20% testing (default in scikit-learn)
  • Sometimes 70-30 or 75-25 is used

You want enough data to train the model and enough left to test the performance reliably.

Visual Example

Let’s say we have 10 rows of data (D1 to D10):

Before split: [D1, D2, D3, D4, D5, D6, D7, D8, D9, D10]
After 80-20 split:
Training set → [D1, D2, D3, D4, D5, D6, D7, D8]
Testing set → [D9, D10]

Python Example: Using train_test_split from Scikit-Learn

from sklearn.model_selection import train_test_split
import pandas as pd

# Sample dataset
data = {
    'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Exam_Score': [35, 40, 50, 55, 65, 70, 75, 80, 90, 95]
}

df = pd.DataFrame(data)

# Features (input) and Labels (output)
X = df[['Hours_Studied']]
y = df['Exam_Score']

# Split the dataset: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print shapes
print("X_train:")
print(X_train)
print("\nX_test:")
print(X_test)
print("\ny_train:")
print(y_train)
print("\ny_test:")
print(y_test)

X_train:
   Hours_Studied
5              6
0              1
7              8
2              3
9             10
4              5
3              4
6              7

X_test:
   Hours_Studied
8              9
1              2

y_train:
5    70
0    35
7    80
2    50
9    95
4    65
3    55
6    75
Name: Exam_Score, dtype: int64

y_test:
8    90
1    40
Name: Exam_Score, dtype: int64

Explanation of the Code

  • We created a simple dataset of Hours_Studied vs Exam_Score.
  • X is the input (feature), y is the output (label).
  • train_test_split() randomly splits the data into training and test sets.
  • test_size=0.2 means 20% of the data goes to the test set.
  • random_state=42 ensures the same split every time you run it (for reproducibility).

Question:

Why do we use random_state?

Answer:

Because data is shuffled randomly before splitting. Setting random_state ensures consistent results for debugging and experiments.

Good Practices

  • Always shuffle data before splitting (done by default).
  • Use stratified splits for classification tasks to maintain class balance (using stratify=y).
  • Never test your model on training data — this leads to overfitting.

Next Steps

Now that you’ve learned how to split data into training and test sets, the next step is to train a model on the training set and evaluate it on the test set. That’s how you know your model is ready for the real world!



Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

You can support this website with a contribution of your choice.

When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M