Train-Test Split in Machine Learning (With Examples and Python Code)

⬅ Previous TopicMachine Learning - Feature Scaling and Normalization

Next Topic ⮕Machine Learning - Linear Regression for Beginners

What is Train-Test Split in Machine Learning?

In machine learning, the train-test split is a technique to evaluate how well your model will perform on unseen data. It simply means dividing your dataset into two parts:

Training Set: Used to train the model.
Testing Set: Used to test how well the model performs on data it has never seen before.

This helps you avoid a common mistake called overfitting—where a model performs well on the training data but poorly on new, unseen data.

Why is Train-Test Split Needed?

Let’s imagine you're studying for an exam. If you keep revising the same questions, you might get perfect at answering them—but will you perform well if the questions change slightly in the real test? Probably not.

Similarly, in ML, we use the test set to simulate unseen "exam questions" to check if our model truly understands the patterns or is just memorizing.

Question:

What might happen if you train and test your model on the same dataset?

Answer:

You may get a high accuracy, but it will be misleading. The model hasn't been tested on unseen data, so we can't trust it to generalize well.

How Much Data Should You Use for Testing?

There's no single rule, but commonly:

80% training and 20% testing (default in scikit-learn)
Sometimes 70-30 or 75-25 is used

You want enough data to train the model and enough left to test the performance reliably.

Visual Example

Let’s say we have 10 rows of data (D1 to D10):

Before split: [D1, D2, D3, D4, D5, D6, D7, D8, D9, D10]
After 80-20 split:
Training set → [D1, D2, D3, D4, D5, D6, D7, D8]
Testing set → [D9, D10]

Python Example: Using `train_test_split` from Scikit-Learn

from sklearn.model_selection import train_test_split
import pandas as pd

# Sample dataset
data = {
    'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Exam_Score': [35, 40, 50, 55, 65, 70, 75, 80, 90, 95]
}

df = pd.DataFrame(data)

# Features (input) and Labels (output)
X = df[['Hours_Studied']]
y = df['Exam_Score']

# Split the dataset: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print shapes
print("X_train:")
print(X_train)
print("\nX_test:")
print(X_test)
print("\ny_train:")
print(y_train)
print("\ny_test:")
print(y_test)


X_train:
   Hours_Studied
5              6
0              1
7              8
2              3
9             10
4              5
3              4
6              7

X_test:
   Hours_Studied
8              9
1              2

y_train:
5    70
0    35
7    80
2    50
9    95
4    65
3    55
6    75
Name: Exam_Score, dtype: int64

y_test:
8    90
1    40
Name: Exam_Score, dtype: int64

Explanation of the Code

We created a simple dataset of Hours_Studied vs Exam_Score.
X is the input (feature), y is the output (label).
train_test_split() randomly splits the data into training and test sets.
test_size=0.2 means 20% of the data goes to the test set.
random_state=42 ensures the same split every time you run it (for reproducibility).

Question:

Why do we use random_state?

Answer:

Because data is shuffled randomly before splitting. Setting random_state ensures consistent results for debugging and experiments.

Good Practices

Always shuffle data before splitting (done by default).
Use stratified splits for classification tasks to maintain class balance (using stratify=y).
Never test your model on training data — this leads to overfitting.

Next Steps

Now that you’ve learned how to split data into training and test sets, the next step is to train a model on the training set and evaluate it on the test set. That’s how you know your model is ready for the real world!

⬅ Previous TopicMachine Learning - Feature Scaling and Normalization

Next Topic ⮕Machine Learning - Linear Regression for Beginners

Train-Test Split in Machine Learning (With Examples and Python Code)

What is Train-Test Split in Machine Learning?

Why is Train-Test Split Needed?

Question:

Answer:

How Much Data Should You Use for Testing?

Visual Example

Python Example: Using train_test_split from Scikit-Learn

Explanation of the Code

Question:

Answer:

Good Practices

Next Steps

Module 3: Data Preprocessing❯

Welcome to ProgramGuru

Player Settings

Python Example: Using `train_test_split` from Scikit-Learn