Skip to content

salahuddin212/Email-Spam-Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

📧 Email Spam Classifier

This repository contains a machine learning project designed to classify emails as either spam or ham (not spam). The project demonstrates a complete end-to-end pipeline for text classification, from data loading and preprocessing to model training, evaluation, and deployment.

✨ Features

  • Data Loading & Preparation: Efficiently loads email data from various categories.
  • Robust Text Preprocessing: Includes comprehensive steps like header removal, HTML tag cleaning, URL/number generalization, punctuation removal, and handling of repeated characters.
  • Model Comparison: Evaluates multiple classification algorithms (Naive Bayes, Logistic Regression, Linear SVM, Random Forest) to identify the best performer.
  • Hyperparameter Tuning: Utilizes GridSearchCV to optimize model parameters for improved performance.
  • Model Persistence: Saves the trained model and vectorizer for easy reuse and deployment.
  • Prediction Function: Provides a straightforward function to classify new emails.

🚀 Getting Started

Prerequisites

To run this project, you will need Python 3.x and the following libraries:

  • pandas
  • numpy
  • scikit-learn
  • joblib
  • re (built-in Python module)

You can install them using pip:

pip install pandas numpy scikit-learn joblib

Data

This project expects email data organized into easy_ham, hard_ham, and spam subdirectories within a base directory. Due to the size and nature of email datasets, the data itself is not included in this repository. You will need to provide your own dataset.

To prepare your data:

  1. Organize your email files into three folders: easy_ham, hard_ham, and spam.
  2. Place these three folders inside a parent directory (e.g., Spam Filter).
  3. Update the base_directory variable in the Spam_Filter.ipynb notebook to point to the absolute path of your parent directory.

Example Data Structure:

your_project_root/
├── Spam_Filter.ipynb
├── models/
│   ├── svm_best_model.pkl
│   └── vectorizer.pkl
├── your_data_directory/  <-- This is your base_directory
│   ├── easy_ham/
│   │   └── email1.txt
│   │   └── email2.txt
│   ├── hard_ham/
│   │   └── email3.txt
│   └── spam/
│       └── spam_email1.txt
│       └── spam_email2.txt
└── README.md

Running the Notebook

  1. Clone this repository:

git clone https://github.com/salahuddin212/Spam-Classifier.git cd Spam-Classifier

  1. Ensure you have Jupyter Notebook installed (pip install notebook).

  2. Launch Jupyter Notebook from the project root directory:
    jupyter notebook

  3. Open Spam_Filter.ipynb and run all cells. The notebook will guide you through:

    • Loading the dataset.
    • Splitting data into training and testing sets.
    • Applying text preprocessing.
    • Vectorizing text data.
    • Training and comparing different machine learning models.
    • Tuning hyperparameters for the best model.
    • Evaluating the final model.
    • Saving the trained model and vectorizer.

⚙️ Project Structure

Spam-Classifier/
├── Spam_Filter.ipynb         # Main Jupyter Notebook with the ML pipeline
├── svm_best_model.pkl        # Trained Linear SVM model (saved after execution)
├── vectorizer.pkl            # Fitted CountVectorizer (saved after execution)
└── README.md                 # Project README file

💡 Areas for Future Improvement

  • Dataset Imbalance: Implement strategies to handle dataset imbalance (e.g., class_weight, oversampling/undersampling with imbalanced-learn).
  • Advanced Feature Engineering: Experiment with TfidfVectorizer, n-grams, or word embeddings for richer text representation.
  • Visualizations: Add data exploration and model performance visualizations (e.g., confusion matrix, ROC curves, word clouds).
  • Robust Model Comparison: Consider nested cross-validation for more rigorous model selection.
  • Modular Code: For larger projects, refactor the notebook into modular Python scripts (e.g., data_loader.py, preprocessing.py, model_training.py).
  • Model Loading Optimization: For production use, optimize the predict_email function to load the model and vectorizer only once.

🤝 Contributing

Contributions are welcome! If you have suggestions for improvements or new features, please open an issue or submit a pull request.

📄 License

This project is licensed under the MIT License

📧 Contact

For any questions or feedback, please contact: https://www.linkedin.com/in/salahuddin-bayassi/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published