📧 Email Spam Classifier

This repository contains a machine learning project designed to classify emails as either spam or ham (not spam). The project demonstrates a complete end-to-end pipeline for text classification, from data loading and preprocessing to model training, evaluation, and deployment.

✨ Features

Data Loading & Preparation: Efficiently loads email data from various categories.
Robust Text Preprocessing: Includes comprehensive steps like header removal, HTML tag cleaning, URL/number generalization, punctuation removal, and handling of repeated characters.
Model Comparison: Evaluates multiple classification algorithms (Naive Bayes, Logistic Regression, Linear SVM, Random Forest) to identify the best performer.
Hyperparameter Tuning: Utilizes GridSearchCV to optimize model parameters for improved performance.
Model Persistence: Saves the trained model and vectorizer for easy reuse and deployment.
Prediction Function: Provides a straightforward function to classify new emails.

🚀 Getting Started

Prerequisites

To run this project, you will need Python 3.x and the following libraries:

pandas
numpy
scikit-learn
joblib
re (built-in Python module)

You can install them using pip:

pip install pandas numpy scikit-learn joblib

Data

This project expects email data organized into easy_ham, hard_ham, and spam subdirectories within a base directory. Due to the size and nature of email datasets, the data itself is not included in this repository. You will need to provide your own dataset.

To prepare your data:

Organize your email files into three folders: easy_ham, hard_ham, and spam.
Place these three folders inside a parent directory (e.g., Spam Filter).
Update the base_directory variable in the Spam_Filter.ipynb notebook to point to the absolute path of your parent directory.

Example Data Structure:

your_project_root/
├── Spam_Filter.ipynb
├── models/
│   ├── svm_best_model.pkl
│   └── vectorizer.pkl
├── your_data_directory/  <-- This is your base_directory
│   ├── easy_ham/
│   │   └── email1.txt
│   │   └── email2.txt
│   ├── hard_ham/
│   │   └── email3.txt
│   └── spam/
│       └── spam_email1.txt
│       └── spam_email2.txt
└── README.md

Running the Notebook

Clone this repository:

git clone https://github.com/salahuddin212/Spam-Classifier.git cd Spam-Classifier

Ensure you have Jupyter Notebook installed (pip install notebook).
Launch Jupyter Notebook from the project root directory:
jupyter notebook
Open Spam_Filter.ipynb and run all cells. The notebook will guide you through:
- Loading the dataset.
- Splitting data into training and testing sets.
- Applying text preprocessing.
- Vectorizing text data.
- Training and comparing different machine learning models.
- Tuning hyperparameters for the best model.
- Evaluating the final model.
- Saving the trained model and vectorizer.

⚙️ Project Structure

Spam-Classifier/
├── Spam_Filter.ipynb         # Main Jupyter Notebook with the ML pipeline
├── svm_best_model.pkl        # Trained Linear SVM model (saved after execution)
├── vectorizer.pkl            # Fitted CountVectorizer (saved after execution)
└── README.md                 # Project README file

💡 Areas for Future Improvement

Dataset Imbalance: Implement strategies to handle dataset imbalance (e.g., class_weight, oversampling/undersampling with imbalanced-learn).
Advanced Feature Engineering: Experiment with TfidfVectorizer, n-grams, or word embeddings for richer text representation.
Visualizations: Add data exploration and model performance visualizations (e.g., confusion matrix, ROC curves, word clouds).
Robust Model Comparison: Consider nested cross-validation for more rigorous model selection.
Modular Code: For larger projects, refactor the notebook into modular Python scripts (e.g., data_loader.py, preprocessing.py, model_training.py).
Model Loading Optimization: For production use, optimize the predict_email function to load the model and vectorizer only once.

🤝 Contributing

Contributions are welcome! If you have suggestions for improvements or new features, please open an issue or submit a pull request.

📄 License

This project is licensed under the MIT License

📧 Contact

For any questions or feedback, please contact: https://www.linkedin.com/in/salahuddin-bayassi/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📧 Email Spam Classifier

✨ Features

🚀 Getting Started

Prerequisites

Data

Running the Notebook

⚙️ Project Structure

💡 Areas for Future Improvement

🤝 Contributing

📄 License

📧 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
models		models
README.md		README.md
Spam_Filter.ipynb		Spam_Filter.ipynb

salahuddin212/Email-Spam-Classifier

Folders and files

Latest commit

History

Repository files navigation

📧 Email Spam Classifier

✨ Features

🚀 Getting Started

Prerequisites

Data

Running the Notebook

⚙️ Project Structure

💡 Areas for Future Improvement

🤝 Contributing

📄 License

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages