Email Spam Detection using Machine Learning

Email Spam Detection – Real-World Machine Learning Project

Email spam detection is one of the most common real-world applications of Machine Learning. The goal is to automatically classify emails as spam or not spam (ham).

We’ll walk through the entire pipeline: loading data, preprocessing, converting text to numerical features, training a model, and evaluating its performance.

Real-World Use Case

Spam emails often contain promotional messages, phishing links, or scams. Gmail and Outlook use ML models to filter these automatically to your Spam folder. We'll create a basic version of such a filter.

Dataset: SMS Spam Collection Dataset

This dataset contains 5,572 SMS messages labeled as "spam" or "ham".

Each row contains:

label: 'ham' or 'spam'
message: The actual text message

Sample Data

label   | message
--------|-----------------------------------------------------
ham     | I'm gonna be home soon and i don't want to talk...
spam    | WINNER!! You have won a $1000 Walmart gift card...
ham     | Ok lar... Joking wif u oni...
spam    | Six chances to win CASH. From 100 to 20,000...

✦ Why can't we just use keywords like 'WIN', 'FREE', 'CASH' to detect spam?

✧ Because spammers use tricks like: "Fr33", "C@sh", or "W!n" to fool such simple filters. A machine learning model can learn patterns beyond keywords using vectorized features.

Step-by-Step Solution

Step 1: Load and Explore Data

import pandas as pd

# Load the dataset (make sure the CSV is downloaded locally or use a URL)
df = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/sms_spam.csv")

# Show basic info
print(df.head())
print(df['label'].value_counts())

   label                                            message
0    ham  Go until jurong point, crazy.. Available only ...
1    ham                      Ok lar... Joking wif u oni...
2   spam  Free entry in 2 a wkly comp to win FA Cup fina...
3    ham  U dun say so early hor... U c already then say...
4    ham  Nah I don't think he goes to usf, he lives aro...

ham     4825
spam     747
Name: label, dtype: int64

✦ Q: Why do we have to convert text to numbers before applying ML?

✧ ML algorithms work on numbers, not raw text. We must convert each message into a vector of numbers representing word frequencies or presence.

Step 2: Encode Labels (ham = 0, spam = 1)

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['label'] = encoder.fit_transform(df['label'])  # ham=0, spam=1

print(df.head())

   label                                            message
0      0  Go until jurong point, crazy.. Available only ...
1      0                      Ok lar... Joking wif u oni...
2      1  Free entry in 2 a wkly comp to win FA Cup fina...
3      0  U dun say so early hor... U c already then say...
4      0  Nah I don't think he goes to usf, he lives aro...

Step 3: Convert Text to Numerical Vectors

We'll use CountVectorizer to convert each message into a bag-of-words representation.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message'])  # Features
y = df['label']                              # Labels

✦ Q: What does a bag-of-words model do?

✧ It builds a vocabulary of all words in the dataset and counts how many times each word appears in a message. Each message becomes a vector of word counts.

Step 4: Train-Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Train a Naive Bayes Model

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

model = MultinomialNB()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

Step 6: Evaluate the Model

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.986
              precision    recall  f1-score   support

            0       0.99      1.00      0.99       965
            1       0.98      0.94      0.96       150

    accuracy                           0.99      1115
    macro avg       0.98      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115

✦ Q: Why is Naive Bayes a good choice here?

✧ It works well with text classification because it assumes word probabilities are independent. It's fast, effective, and easy to interpret for beginners.

Step 7: Predict Your Own Message

sample = ["Congratulations! You've won a free iPhone! Click here to claim."]
sample_vector = vectorizer.transform(sample)
print("Spam" if model.predict(sample_vector)[0] else "Not Spam")

Spam

Summary

We used CountVectorizer to convert text into numeric vectors
Trained a Multinomial Naive Bayes model
Evaluated its accuracy using real-world SMS data
Predicted spam on custom examples

What’s Next?

You can experiment with:

TfidfVectorizer for weighted word importance
Other models like Logistic Regression or SVM
Cleaning text using regex (remove stopwords, punctuation, etc.)

Comments

Loading comments...