This repository contains a machine learning project designed to classify emails as either spam or ham (not spam). The project demonstrates a complete end-to-end pipeline for text classification, from data loading and preprocessing to model training, evaluation, and deployment.
- Data Loading & Preparation: Efficiently loads email data from various categories.
- Robust Text Preprocessing: Includes comprehensive steps like header removal, HTML tag cleaning, URL/number generalization, punctuation removal, and handling of repeated characters.
- Model Comparison: Evaluates multiple classification algorithms (Naive Bayes, Logistic Regression, Linear SVM, Random Forest) to identify the best performer.
- Hyperparameter Tuning: Utilizes GridSearchCV to optimize model parameters for improved performance.
- Model Persistence: Saves the trained model and vectorizer for easy reuse and deployment.
- Prediction Function: Provides a straightforward function to classify new emails.
To run this project, you will need Python 3.x and the following libraries:
pandas
numpy
scikit-learn
joblib
re
(built-in Python module)
You can install them using pip:
pip install pandas numpy scikit-learn joblib
This project expects email data organized into easy_ham
, hard_ham
, and spam
subdirectories within a base directory. Due to the size and nature of email datasets, the data itself is not included in this repository. You will need to provide your own dataset.
To prepare your data:
- Organize your email files into three folders:
easy_ham
,hard_ham
, andspam
. - Place these three folders inside a parent directory (e.g.,
Spam Filter
). - Update the
base_directory
variable in theSpam_Filter.ipynb
notebook to point to the absolute path of your parent directory.
Example Data Structure:
your_project_root/
├── Spam_Filter.ipynb
├── models/
│ ├── svm_best_model.pkl
│ └── vectorizer.pkl
├── your_data_directory/ <-- This is your base_directory
│ ├── easy_ham/
│ │ └── email1.txt
│ │ └── email2.txt
│ ├── hard_ham/
│ │ └── email3.txt
│ └── spam/
│ └── spam_email1.txt
│ └── spam_email2.txt
└── README.md
- Clone this repository:
git clone https://github.com/salahuddin212/Spam-Classifier.git cd Spam-Classifier
-
Ensure you have Jupyter Notebook installed (
pip install notebook
). -
Launch Jupyter Notebook from the project root directory:
jupyter notebook -
Open
Spam_Filter.ipynb
and run all cells. The notebook will guide you through:- Loading the dataset.
- Splitting data into training and testing sets.
- Applying text preprocessing.
- Vectorizing text data.
- Training and comparing different machine learning models.
- Tuning hyperparameters for the best model.
- Evaluating the final model.
- Saving the trained model and vectorizer.
Spam-Classifier/
├── Spam_Filter.ipynb # Main Jupyter Notebook with the ML pipeline
├── svm_best_model.pkl # Trained Linear SVM model (saved after execution)
├── vectorizer.pkl # Fitted CountVectorizer (saved after execution)
└── README.md # Project README file
- Dataset Imbalance: Implement strategies to handle dataset imbalance (e.g.,
class_weight
, oversampling/undersampling withimbalanced-learn
). - Advanced Feature Engineering: Experiment with
TfidfVectorizer
, n-grams, or word embeddings for richer text representation. - Visualizations: Add data exploration and model performance visualizations (e.g., confusion matrix, ROC curves, word clouds).
- Robust Model Comparison: Consider nested cross-validation for more rigorous model selection.
- Modular Code: For larger projects, refactor the notebook into modular Python scripts (e.g.,
data_loader.py
,preprocessing.py
,model_training.py
). - Model Loading Optimization: For production use, optimize the
predict_email
function to load the model and vectorizer only once.
Contributions are welcome! If you have suggestions for improvements or new features, please open an issue or submit a pull request.
This project is licensed under the MIT License
For any questions or feedback, please contact: https://www.linkedin.com/in/salahuddin-bayassi/