What is Train-Test Split in Machine Learning?
In machine learning, the train-test split is a technique to evaluate how well your model will perform on unseen data. It simply means dividing your dataset into two parts:
- Training Set: Used to train the model.
- Testing Set: Used to test how well the model performs on data it has never seen before.
This helps you avoid a common mistake called overfitting—where a model performs well on the training data but poorly on new, unseen data.
Why is Train-Test Split Needed?
Let’s imagine you're studying for an exam. If you keep revising the same questions, you might get perfect at answering them—but will you perform well if the questions change slightly in the real test? Probably not.
Similarly, in ML, we use the test set to simulate unseen "exam questions" to check if our model truly understands the patterns or is just memorizing.
🧠 Question:
What might happen if you train and test your model on the same dataset?
Answer:
You may get a high accuracy, but it will be misleading. The model hasn't been tested on unseen data, so we can't trust it to generalize well.
How Much Data Should You Use for Testing?
There's no single rule, but commonly:
- 80% training and 20% testing (default in scikit-learn)
- Sometimes 70-30 or 75-25 is used
You want enough data to train the model and enough left to test the performance reliably.
Visual Example
Let’s say we have 10 rows of data (D1 to D10):
Before split: [D1, D2, D3, D4, D5, D6, D7, D8, D9, D10]
After 80-20 split:
Training set → [D1, D2, D3, D4, D5, D6, D7, D8]
Testing set → [D9, D10]
Python Example: Using train_test_split
from Scikit-Learn
from sklearn.model_selection import train_test_split
import pandas as pd
# Sample dataset
data = {
'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Exam_Score': [35, 40, 50, 55, 65, 70, 75, 80, 90, 95]
}
df = pd.DataFrame(data)
# Features (input) and Labels (output)
X = df[['Hours_Studied']]
y = df['Exam_Score']
# Split the dataset: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Print shapes
print("X_train:")
print(X_train)
print("\nX_test:")
print(X_test)
print("\ny_train:")
print(y_train)
print("\ny_test:")
print(y_test)
Output:
X_train:
Hours_Studied
5 6
0 1
7 8
2 3
9 10
4 5
3 4
6 7
X_test:
Hours_Studied
8 9
1 2
y_train:
5 70
0 35
7 80
2 50
9 95
4 65
3 55
6 75
Name: Exam_Score, dtype: int64
y_test:
8 90
1 40
Name: Exam_Score, dtype: int64
Explanation of the Code
- We created a simple dataset of
Hours_Studied
vs Exam_Score
.
X
is the input (feature), y
is the output (label).
train_test_split()
randomly splits the data into training and test sets.
test_size=0.2
means 20% of the data goes to the test set.
random_state=42
ensures the same split every time you run it (for reproducibility).
🧠 Question:
Why do we use random_state
?
Answer:
Because data is shuffled randomly before splitting. Setting random_state
ensures consistent results for debugging and experiments.
Good Practices
- Always shuffle data before splitting (done by default).
- Use stratified splits for classification tasks to maintain class balance (using
stratify=y
).
- Never test your model on training data — this leads to overfitting.
Next Steps
Now that you’ve learned how to split data into training and test sets, the next step is to train a model on the training set and evaluate it on the test set. That’s how you know your model is ready for the real world!