🌟 A Comprehensive Text Classification Pipeline for Email Subject Categorization Using NLP and Naive Bayes 🌟
- Introduction
- Features
- System Requirements
- Installation
- Getting Started
- Usage
- Contributing
- Security
- Code of Conduct
- Support
- License
- Acknowledgements
NLP-Email-Categorizer is an open-source project that provides a robust text classification pipeline for categorizing email subjects using Natural Language Processing (NLP) techniques and the Multinomial Naive Bayes algorithm. The repository includes two Jupyter notebooks:
- Naive_Bayes_Text_Classification.ipynb: A comprehensive pipeline with advanced features like hyperparameter tuning, cross-validation, data augmentation, and an interactive GUI for predictions.
- Text_Classification_Pipeline_for_Email_Subjects.ipynb: A simplified pipeline focused on core classification steps for beginners or quick deployment.
Designed for data scientists, NLP enthusiasts, and developers, this project demonstrates how to preprocess text data, extract features, train a classifier, evaluate performance, and deploy a model for real-world email categorization tasks (e.g., spam detection, priority sorting). The notebooks leverage popular Python libraries like scikit-learn
, nltk
, pandas
, and joblib
, with clear logging and visualizations for transparency.
Note: This project is not actively maintained. The notebooks assume a dataset with email subjects and categories, which is not included. Users must provide their own dataset. Contributions to enhance functionality or documentation are welcome!
- Text Preprocessing:
- Converts text to lowercase, removes punctuation, tokenizes words, and eliminates stopwords using
nltk
. - Handles missing values to ensure robust data preparation.
- Converts text to lowercase, removes punctuation, tokenizes words, and eliminates stopwords using
- Feature Extraction:
- Uses
CountVectorizer
to transform text into numerical features based on word frequencies.
- Uses
- Model Training:
- Implements Multinomial Naive Bayes for efficient text classification.
- Includes hyperparameter tuning via
GridSearchCV
(advanced notebook).
- Model Evaluation:
- Computes accuracy, precision, recall, and F1-score using
classification_report
. - Visualizes performance with confusion matrix heatmaps (advanced notebook).
- Performs cross-validation to assess model robustness (advanced notebook).
- Computes accuracy, precision, recall, and F1-score using
- Model Persistence:
- Saves trained models and vectorizers using
joblib
, zipped for portability. - Supports loading saved models for predictions on new data.
- Saves trained models and vectorizers using
- Interactive Prediction:
- Provides a GUI for real-time predictions using
ipywidgets
(advanced notebook). - Allows single-text predictions via preprocessed input (simplified notebook).
- Provides a GUI for real-time predictions using
- Data Augmentation:
- Optional text augmentation using WordNet synonyms to enhance training data (advanced notebook).
- Visualization:
- Displays category distribution with bar plots.
- Generates confusion matrix heatmaps for intuitive performance analysis.
- Logging:
- Implements detailed logging for debugging and process tracking.
- Modularity:
- Two notebooks cater to different user needs: advanced for experts, simplified for beginners.
- Portability:
- Runs in Jupyter environments (e.g., Google Colab, local Jupyter Notebook) with minimal dependencies.
To run NLP-Email-Categorizer, ensure you have:
- Operating System: Windows, macOS, Linux, or any system supporting Python.
- Python Version: Python 3.6 or higher.
- Jupyter Environment: Jupyter Notebook, JupyterLab, or Google Colab.
- Disk Space: ~50 MB for notebooks, dependencies, and saved models (dataset size varies).
- Dependencies (listed in notebooks):
pandas
,numpy
: Data manipulation and analysis.scikit-learn
: Machine learning and feature extraction.nltk
: Text preprocessing (stopwords, tokenization, WordNet).matplotlib
,seaborn
: Visualizations.joblib
: Model persistence.ipywidgets
: Interactive GUI (advanced notebook).zipfile
: Model zipping.
- Dataset: A CSV or TSV file with at least two columns:
Subject
(email subject text) andCategory
(label, e.g., spam, ham).
Follow these steps to set up NLP-Email-Categorizer locally:
-
Clone the Repository:
git clone https://github.com/VoxDroid/NLP-Email-Categorizer.git
-
Navigate to the Project Directory:
cd NLP-Email-Categorizer
-
Create a Virtual Environment (recommended):
python -m venv venv source venv/bin/activate # Linux/macOS venv\Scripts\activate # Windows
-
Install Dependencies:
pip install pandas numpy scikit-learn nltk matplotlib seaborn joblib ipywidgets
-
Install NLTK Data: Run the following in a Python shell or notebook cell:
import nltk nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet')
-
Prepare Your Dataset:
- Place your dataset (e.g.,
email_dataset.csv
oryour_tsv_file_here.tsv
) in the project directory. - Ensure it has
Subject
andCategory
columns. Example format:Subject,Category "Meeting tomorrow at 10 AM",Meeting "Win a free iPhone now!",Spam
- Place your dataset (e.g.,
-
Launch Jupyter Notebook:
jupyter notebook
Open
Naive_Bayes_Text_Classification.ipynb
orText_Classification_Pipeline_for_Email_Subjects.ipynb
in your browser.
Note: If using Google Colab, upload the notebooks and dataset to Colab, install dependencies via
!pip install
, and update file paths as needed.
To start using NLP-Email-Categorizer:
-
Open a Notebook:
- Choose
Naive_Bayes_Text_Classification.ipynb
for advanced features (e.g., GUI, augmentation). - Choose
Text_Classification_Pipeline_for_Email_Subjects.ipynb
for a simpler pipeline.
- Choose
-
Configure the Dataset:
- Update the
file_name
parameter in the first cell to your dataset’s path (e.g.,email_dataset.csv
). - Ensure the file is in the correct format (CSV for advanced notebook, TSV for simplified).
- Update the
-
Run the Notebook:
- Execute cells sequentially using Shift+Enter.
- Monitor logging messages for progress (e.g., data loading, preprocessing).
- Check visualizations (e.g., category distribution, confusion matrix).
-
Test Functionality:
- Verify data loading and preprocessing by inspecting the output of
data.head()
anddata['Subject'].head()
. - Evaluate model performance via accuracy and classification reports.
- Test predictions using the GUI (advanced notebook) or manual input (simplified notebook).
- Experiment with data augmentation by modifying the example text in the advanced notebook.
- Verify data loading and preprocessing by inspecting the output of
- Data Preparation:
- Loads a CSV dataset and visualizes category distribution.
- Example: Bar plot showing counts of categories (e.g., Spam, Meeting).
- Preprocessing:
- Applies lowercase conversion, punctuation removal, tokenization, and stopword removal to email subjects.
- Training:
- Splits data (80% train, 20% test), vectorizes text, and trains a Naive Bayes model with hyperparameter tuning (
alpha
).
- Splits data (80% train, 20% test), vectorizes text, and trains a Naive Bayes model with hyperparameter tuning (
- Evaluation:
- Outputs accuracy, classification report, confusion matrix heatmap, and cross-validation scores.
- Example: Confusion matrix highlights misclassifications (e.g., Spam vs. Ham).
- Prediction:
- Use the GUI to input an email subject and predict its category.
- Example: Input “Urgent meeting today” → Predicts “Meeting”.
- Augmentation:
- Generates synonym-based variations of text (e.g., “important meeting” → “crucial gathering”).
- Model Saving:
- Saves model and vectorizer to a zip file (
model_and_vectorizer.zip
).
- Saves model and vectorizer to a zip file (
- Data Preparation:
- Loads a TSV dataset and displays the first few rows.
- Preprocessing:
- Similar preprocessing steps as the advanced notebook.
- Training:
- Trains a Naive Bayes model without hyperparameter tuning for simplicity.
- Evaluation:
- Outputs accuracy and classification report.
- Prediction:
- Predicts the category for a single user-provided email subject.
- Example: Input “Free offer now” → Predicts “Spam”.
- Model Saving:
- Saves the model to a zip file (
model.zip
).
- Saves the model to a zip file (
- Load dataset (
email_dataset.csv
). - Run preprocessing to clean subjects (e.g., “Meeting Tomorrow!” → “meeting tomorrow”).
- Train the model and achieve ~85% accuracy (dataset-dependent).
- Use the GUI to predict: “Project deadline extended” → “Positive”.
- Save the model and reuse it for new predictions.
We welcome contributions to NLP-Email-Categorizer! To get involved:
- Review the Contributing Guidelines for details on submitting issues, feature requests, or pull requests.
- Fork the repository, make changes, and submit a pull request.
- Adhere to the Code of Conduct to ensure a respectful community.
Example contributions:
- Add support for other classifiers (e.g., Logistic Regression, SVM).
- Enhance preprocessing with lemmatization or stemming.
- Improve the GUI with additional features (e.g., batch predictions).
- Create a sample dataset for testing.
Security is a priority for NLP-Email-Categorizer. If you discover a vulnerability:
- Report it privately as outlined in the Security Policy.
- Avoid public disclosure until the issue is resolved.
All contributors and users are expected to follow the Code of Conduct to maintain a welcoming and inclusive environment.
Need help with NLP-Email-Categorizer? Visit the Support page for resources, including:
- Filing bug reports or feature requests.
- Community discussions and contact information.
- FAQs for common issues (e.g., dataset formatting, dependency errors).
NLP-Email-Categorizer is licensed under the MIT License. See the LICENSE file for details.
- VoxDroid: For creating and maintaining the project.
- scikit-learn: For robust machine learning tools.
- NLTK: For powerful NLP preprocessing capabilities.
- Contributors: Thanks to all who report issues, suggest features, or contribute code.
- NLP Community: For inspiring accessible and practical text classification solutions.