Email/SMS Spam-Ham Classification

A Data Science & Machine Learning-powered Flask Web Application that classifies Email/SMS messages as either Spam or Ham (Not Spam).

📌 Project Overview

This project demonstrates how Natural Language Processing (NLP) and Machine Learning can be used to classify text messages. It includes data cleaning, EDA, feature engineering, model building, and deployment in a real-time Web Application.

🧩 Problem Statement

With the ever-growing volume of digital communication, especially SMS and Email, distinguishing between legitimate messages and unwanted spam is critical. Spam messages often carry scams, phishing links, or irrelevant promotions that compromise user experience and safety.

💡 Why This Project Is Useful

This project provides an end-to-end solution for spam detection, from data preprocessing and feature engineering to deploying a real-time Flask Web Application. It helps automate message classification, enhancing communication security and saving users from spam clutter.

📦 Dataset & Preparation

Source: Kaggle SMS Spam Collection Dataset
Initial Shape: 5572 rows × 5 columns
Preprocessing:
- Dropped irrelevant columns: Unnamed: 2, Unnamed: 3, Unnamed: 4
- Renamed: v1 → label, v2 → message
- Removed 403 duplicate rows

📊 Exploratory Data Analysis (EDA)

Class Distribution:
- Ham: 87.4%
- Spam: 12.6%
Feature Creation:
- num_of_char: Total characters
- num_of_words: Total words (nltk.tokenize.word_tokenize)
- num_of_sentences: Sentence count
Visualizations:
- Pie chart of class distribution
- Bar charts & histograms for custom features
- Pairplot for feature relationships
- Word frequency bar plots for both spam and ham
Dropped temporary features after insights

🧹 Text Preprocessing

Implemented a custom text transformation pipeline:

Lowercasing
Tokenization with nltk
Stopword removal (including HTML tags, emojis, punctuation, digits)
Lemmatization & Stemming using PorterStemmer
Dropped 87 duplicates after text transformation

⚙️ Feature Engineering

Generated word clouds for:
- Spam (e.g., "free", "win", "txt", "offer")
- Ham (e.g., "love", "ok", "come", "home")
Vectorized text using CountVectorizer
Applied StratifiedShuffleSplit to ensure balanced train/test splits

🤖 Model Building & Evaluation

Tested the following classifiers:

MultinomialNB
RandomForestClassifier
SVC
KNeighborsClassifier

Evaluation Metrics: Accuracy, Precision, Confusion Matrix

📋 Results Summary:

| Algorithm | Training_Accuracy | Test_Accuracy | Training_Precision | Test_Precision |
|-----------|-------------------|---------------|---------------------|----------------|
| MNB       | 0.9909            | 0.9715        | 0.9676              | 0.8750         |
| RF        | 1.0000            | 0.9626        | 1.0000              | 1.0000         |
| SVC       | 0.9956            | 0.9607        | 1.0000              | 0.9885         |
| KNN       | 0.9262            | 0.9105        | 1.0000              | 1.0000         |

✅ Best Model: RandomForestClassifier

🚀 Deployment

Saved final model and vectorizer using pickle
Deployed using Flask for real-time predictions
Users can input a message and receive immediate spam/ham classification

🛠️ Technologies Used

Programming: Python
Libraries: Pandas, Numpy, NLTK, Scikit-learn
Visualization: Matplotlib, Seaborn, WordCloud
Web Framework: Flask

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dataset		dataset
models		models
static		static
templates		templates
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
text_transformer.py		text_transformer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Email/SMS Spam-Ham Classification

📌 Project Overview

🧩 Problem Statement

💡 Why This Project Is Useful

📦 Dataset & Preparation

📊 Exploratory Data Analysis (EDA)

🧹 Text Preprocessing

⚙️ Feature Engineering

🤖 Model Building & Evaluation

📋 Results Summary:

🚀 Deployment

🛠️ Technologies Used

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sagarprajapat2004/Spam_Ham_Classification

Folders and files

Latest commit

History

Repository files navigation

Email/SMS Spam-Ham Classification

📌 Project Overview

🧩 Problem Statement

💡 Why This Project Is Useful

📦 Dataset & Preparation

📊 Exploratory Data Analysis (EDA)

🧹 Text Preprocessing

⚙️ Feature Engineering

🤖 Model Building & Evaluation

📋 Results Summary:

🚀 Deployment

🛠️ Technologies Used

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages