Email Spam Detection – Real-World Machine Learning Project
Email spam detection is one of the most common real-world applications of Machine Learning. The goal is to automatically classify emails as spam or not spam (ham).
We’ll walk through the entire pipeline: loading data, preprocessing, converting text to numerical features, training a model, and evaluating its performance.
Real-World Use Case
Spam emails often contain promotional messages, phishing links, or scams. Gmail and Outlook use ML models to filter these automatically to your Spam folder. We'll create a basic version of such a filter.
Dataset: SMS Spam Collection Dataset
This dataset contains 5,572 SMS messages labeled as "spam" or "ham".
Each row contains:
label
: 'ham' or 'spam'message
: The actual text message
Sample Data
label | message --------|----------------------------------------------------- ham | I'm gonna be home soon and i don't want to talk... spam | WINNER!! You have won a $1000 Walmart gift card... ham | Ok lar... Joking wif u oni... spam | Six chances to win CASH. From 100 to 20,000...
✦ Why can't we just use keywords like 'WIN', 'FREE', 'CASH' to detect spam?
✧ Because spammers use tricks like: "Fr33", "C@sh", or "W!n" to fool such simple filters. A machine learning model can learn patterns beyond keywords using vectorized features.
Step-by-Step Solution
Step 1: Load and Explore Data
import pandas as pd
# Load the dataset (make sure the CSV is downloaded locally or use a URL)
df = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/sms_spam.csv")
# Show basic info
print(df.head())
print(df['label'].value_counts())
label message 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 3 ham U dun say so early hor... U c already then say... 4 ham Nah I don't think he goes to usf, he lives aro... ham 4825 spam 747 Name: label, dtype: int64
✦ Q: Why do we have to convert text to numbers before applying ML?
✧ ML algorithms work on numbers, not raw text. We must convert each message into a vector of numbers representing word frequencies or presence.
Step 2: Encode Labels (ham = 0, spam = 1)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['label'] = encoder.fit_transform(df['label']) # ham=0, spam=1
print(df.head())
label message 0 0 Go until jurong point, crazy.. Available only ... 1 0 Ok lar... Joking wif u oni... 2 1 Free entry in 2 a wkly comp to win FA Cup fina... 3 0 U dun say so early hor... U c already then say... 4 0 Nah I don't think he goes to usf, he lives aro...
Step 3: Convert Text to Numerical Vectors
We'll use CountVectorizer
to convert each message into a bag-of-words representation.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message']) # Features
y = df['label'] # Labels
✦ Q: What does a bag-of-words model do?
✧ It builds a vocabulary of all words in the dataset and counts how many times each word appears in a message. Each message becomes a vector of word counts.
Step 4: Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 5: Train a Naive Bayes Model
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
model = MultinomialNB()
model.fit(X_train, y_train)
# Predict on test set
y_pred = model.predict(X_test)
Step 6: Evaluate the Model
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Accuracy: 0.986 precision recall f1-score support 0 0.99 1.00 0.99 965 1 0.98 0.94 0.96 150 accuracy 0.99 1115 macro avg 0.98 0.97 0.97 1115 weighted avg 0.99 0.99 0.99 1115
✦ Q: Why is Naive Bayes a good choice here?
✧ It works well with text classification because it assumes word probabilities are independent. It's fast, effective, and easy to interpret for beginners.
Step 7: Predict Your Own Message
sample = ["Congratulations! You've won a free iPhone! Click here to claim."]
sample_vector = vectorizer.transform(sample)
print("Spam" if model.predict(sample_vector)[0] else "Not Spam")
Spam
Summary
- We used
CountVectorizer
to convert text into numeric vectors - Trained a
Multinomial Naive Bayes
model - Evaluated its accuracy using real-world SMS data
- Predicted spam on custom examples
What’s Next?
You can experiment with:
TfidfVectorizer
for weighted word importance- Other models like Logistic Regression or SVM
- Cleaning text using regex (remove stopwords, punctuation, etc.)