Machine Learning for BeginnersMachine Learning for Beginners1

Email Spam Detection using Machine Learning



Email Spam Detection – Real-World Machine Learning Project

Email spam detection is one of the most common real-world applications of Machine Learning. The goal is to automatically classify emails as spam or not spam (ham).

We’ll walk through the entire pipeline: loading data, preprocessing, converting text to numerical features, training a model, and evaluating its performance.

Real-World Use Case

Spam emails often contain promotional messages, phishing links, or scams. Gmail and Outlook use ML models to filter these automatically to your Spam folder. We'll create a basic version of such a filter.

Dataset: SMS Spam Collection Dataset

This dataset contains 5,572 SMS messages labeled as "spam" or "ham".

Each row contains:

Sample Data

label   | message
--------|-----------------------------------------------------
ham     | I'm gonna be home soon and i don't want to talk...
spam    | WINNER!! You have won a $1000 Walmart gift card...
ham     | Ok lar... Joking wif u oni...
spam    | Six chances to win CASH. From 100 to 20,000...

✦ Why can't we just use keywords like 'WIN', 'FREE', 'CASH' to detect spam?

✧ Because spammers use tricks like: "Fr33", "C@sh", or "W!n" to fool such simple filters. A machine learning model can learn patterns beyond keywords using vectorized features.

Step-by-Step Solution

Step 1: Load and Explore Data


import pandas as pd

# Load the dataset (make sure the CSV is downloaded locally or use a URL)
df = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/sms_spam.csv")

# Show basic info
print(df.head())
print(df['label'].value_counts())
   label                                            message
0    ham  Go until jurong point, crazy.. Available only ...
1    ham                      Ok lar... Joking wif u oni...
2   spam  Free entry in 2 a wkly comp to win FA Cup fina...
3    ham  U dun say so early hor... U c already then say...
4    ham  Nah I don't think he goes to usf, he lives aro...

ham     4825
spam     747
Name: label, dtype: int64

✦ Q: Why do we have to convert text to numbers before applying ML?

✧ ML algorithms work on numbers, not raw text. We must convert each message into a vector of numbers representing word frequencies or presence.

Step 2: Encode Labels (ham = 0, spam = 1)


from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['label'] = encoder.fit_transform(df['label'])  # ham=0, spam=1

print(df.head())
   label                                            message
0      0  Go until jurong point, crazy.. Available only ...
1      0                      Ok lar... Joking wif u oni...
2      1  Free entry in 2 a wkly comp to win FA Cup fina...
3      0  U dun say so early hor... U c already then say...
4      0  Nah I don't think he goes to usf, he lives aro...

Step 3: Convert Text to Numerical Vectors

We'll use CountVectorizer to convert each message into a bag-of-words representation.


from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message'])  # Features
y = df['label']                              # Labels

✦ Q: What does a bag-of-words model do?

✧ It builds a vocabulary of all words in the dataset and counts how many times each word appears in a message. Each message becomes a vector of word counts.

Step 4: Train-Test Split


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Train a Naive Bayes Model


from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

model = MultinomialNB()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

Step 6: Evaluate the Model


print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Accuracy: 0.986
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       965
           1       0.98      0.94      0.96       150

    accuracy                           0.99      1115
   macro avg       0.98      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115

✦ Q: Why is Naive Bayes a good choice here?

✧ It works well with text classification because it assumes word probabilities are independent. It's fast, effective, and easy to interpret for beginners.

Step 7: Predict Your Own Message


sample = ["Congratulations! You've won a free iPhone! Click here to claim."]
sample_vector = vectorizer.transform(sample)
print("Spam" if model.predict(sample_vector)[0] else "Not Spam")
Spam

Summary

What’s Next?

You can experiment with:


Welcome to ProgramGuru

Sign up to start your journey with us

Support ProgramGuru.org

Mention your name, and programguru.org in the message. Your name shall be displayed in the sponsers list.

PayPal

UPI

PhonePe QR

MALLIKARJUNA M