You can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.
Email spam detection is one of the most common real-world applications of Machine Learning. The goal is to automatically classify emails as spam or not spam (ham).
We’ll walk through the entire pipeline: loading data, preprocessing, converting text to numerical features, training a model, and evaluating its performance.
Spam emails often contain promotional messages, phishing links, or scams. Gmail and Outlook use ML models to filter these automatically to your Spam folder. We'll create a basic version of such a filter.
This dataset contains 5,572 SMS messages labeled as "spam" or "ham".
Each row contains:
label
: 'ham' or 'spam'message
: The actual text messagelabel | message --------|----------------------------------------------------- ham | I'm gonna be home soon and i don't want to talk... spam | WINNER!! You have won a $1000 Walmart gift card... ham | Ok lar... Joking wif u oni... spam | Six chances to win CASH. From 100 to 20,000...
✧ Because spammers use tricks like: "Fr33", "C@sh", or "W!n" to fool such simple filters. A machine learning model can learn patterns beyond keywords using vectorized features.
import pandas as pd
# Load the dataset (make sure the CSV is downloaded locally or use a URL)
df = pd.read_csv("https://raw.githubusercontent.com/amankharwal/Website-data/master/sms_spam.csv")
# Show basic info
print(df.head())
print(df['label'].value_counts())
label message 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 3 ham U dun say so early hor... U c already then say... 4 ham Nah I don't think he goes to usf, he lives aro... ham 4825 spam 747 Name: label, dtype: int64
✧ ML algorithms work on numbers, not raw text. We must convert each message into a vector of numbers representing word frequencies or presence.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['label'] = encoder.fit_transform(df['label']) # ham=0, spam=1
print(df.head())
label message 0 0 Go until jurong point, crazy.. Available only ... 1 0 Ok lar... Joking wif u oni... 2 1 Free entry in 2 a wkly comp to win FA Cup fina... 3 0 U dun say so early hor... U c already then say... 4 0 Nah I don't think he goes to usf, he lives aro...
We'll use CountVectorizer
to convert each message into a bag-of-words representation.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['message']) # Features
y = df['label'] # Labels
✧ It builds a vocabulary of all words in the dataset and counts how many times each word appears in a message. Each message becomes a vector of word counts.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
model = MultinomialNB()
model.fit(X_train, y_train)
# Predict on test set
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Accuracy: 0.986 precision recall f1-score support 0 0.99 1.00 0.99 965 1 0.98 0.94 0.96 150 accuracy 0.99 1115 macro avg 0.98 0.97 0.97 1115 weighted avg 0.99 0.99 0.99 1115
✧ It works well with text classification because it assumes word probabilities are independent. It's fast, effective, and easy to interpret for beginners.
sample = ["Congratulations! You've won a free iPhone! Click here to claim."]
sample_vector = vectorizer.transform(sample)
print("Spam" if model.predict(sample_vector)[0] else "Not Spam")
Spam
CountVectorizer
to convert text into numeric vectorsMultinomial Naive Bayes
modelYou can experiment with:
TfidfVectorizer
for weighted word importanceYou can support this website with a contribution of your choice.
When making a contribution, mention your name, and programguru.org in the message. Your name shall be displayed in the sponsors list.