📧 NLP-Email-Categorizer

🌟 A Comprehensive Text Classification Pipeline for Email Subject Categorization Using NLP and Naive Bayes 🌟

Introduction

NLP-Email-Categorizer is an open-source project that provides a robust text classification pipeline for categorizing email subjects using Natural Language Processing (NLP) techniques and the Multinomial Naive Bayes algorithm. The repository includes two Jupyter notebooks:

Naive_Bayes_Text_Classification.ipynb: A comprehensive pipeline with advanced features like hyperparameter tuning, cross-validation, data augmentation, and an interactive GUI for predictions.
Text_Classification_Pipeline_for_Email_Subjects.ipynb: A simplified pipeline focused on core classification steps for beginners or quick deployment.

Designed for data scientists, NLP enthusiasts, and developers, this project demonstrates how to preprocess text data, extract features, train a classifier, evaluate performance, and deploy a model for real-world email categorization tasks (e.g., spam detection, priority sorting). The notebooks leverage popular Python libraries like scikit-learn, nltk, pandas, and joblib, with clear logging and visualizations for transparency.

Note: This project is not actively maintained. The notebooks assume a dataset with email subjects and categories, which is not included. Users must provide their own dataset. Contributions to enhance functionality or documentation are welcome!

Features

Text Preprocessing:
- Converts text to lowercase, removes punctuation, tokenizes words, and eliminates stopwords using nltk.
- Handles missing values to ensure robust data preparation.
Feature Extraction:
- Uses CountVectorizer to transform text into numerical features based on word frequencies.
Model Training:
- Implements Multinomial Naive Bayes for efficient text classification.
- Includes hyperparameter tuning via GridSearchCV (advanced notebook).
Model Evaluation:
- Computes accuracy, precision, recall, and F1-score using classification_report.
- Visualizes performance with confusion matrix heatmaps (advanced notebook).
- Performs cross-validation to assess model robustness (advanced notebook).
Model Persistence:
- Saves trained models and vectorizers using joblib, zipped for portability.
- Supports loading saved models for predictions on new data.
Interactive Prediction:
- Provides a GUI for real-time predictions using ipywidgets (advanced notebook).
- Allows single-text predictions via preprocessed input (simplified notebook).
Data Augmentation:
- Optional text augmentation using WordNet synonyms to enhance training data (advanced notebook).
Visualization:
- Displays category distribution with bar plots.
- Generates confusion matrix heatmaps for intuitive performance analysis.
Logging:
- Implements detailed logging for debugging and process tracking.
Modularity:
- Two notebooks cater to different user needs: advanced for experts, simplified for beginners.
Portability:
- Runs in Jupyter environments (e.g., Google Colab, local Jupyter Notebook) with minimal dependencies.

System Requirements

To run NLP-Email-Categorizer, ensure you have:

Operating System: Windows, macOS, Linux, or any system supporting Python.
Python Version: Python 3.6 or higher.
Jupyter Environment: Jupyter Notebook, JupyterLab, or Google Colab.
Disk Space: ~50 MB for notebooks, dependencies, and saved models (dataset size varies).
Dependencies (listed in notebooks):
- pandas, numpy: Data manipulation and analysis.
- scikit-learn: Machine learning and feature extraction.
- nltk: Text preprocessing (stopwords, tokenization, WordNet).
- matplotlib, seaborn: Visualizations.
- joblib: Model persistence.
- ipywidgets: Interactive GUI (advanced notebook).
- zipfile: Model zipping.
Dataset: A CSV or TSV file with at least two columns: Subject (email subject text) and Category (label, e.g., spam, ham).

Installation

Follow these steps to set up NLP-Email-Categorizer locally:

Clone the Repository:

git clone https://github.com/VoxDroid/NLP-Email-Categorizer.git

Navigate to the Project Directory:
```
cd NLP-Email-Categorizer
```

Create a Virtual Environment (recommended):

python -m venv venv
source venv/bin/activate  # Linux/macOS
venv\Scripts\activate     # Windows

Install Dependencies:

pip install pandas numpy scikit-learn nltk matplotlib seaborn joblib ipywidgets

Install NLTK Data: Run the following in a Python shell or notebook cell:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

Prepare Your Dataset:
- Place your dataset (e.g., email_dataset.csv or your_tsv_file_here.tsv) in the project directory.
- Ensure it has Subject and Category columns. Example format:
```
Subject,Category
"Meeting tomorrow at 10 AM",Meeting
"Win a free iPhone now!",Spam
```
Launch Jupyter Notebook:
```
jupyter notebook
```
Open Naive_Bayes_Text_Classification.ipynb or Text_Classification_Pipeline_for_Email_Subjects.ipynb in your browser.

Note: If using Google Colab, upload the notebooks and dataset to Colab, install dependencies via !pip install, and update file paths as needed.

Getting Started

To start using NLP-Email-Categorizer:

Open a Notebook:
- Choose Naive_Bayes_Text_Classification.ipynb for advanced features (e.g., GUI, augmentation).
- Choose Text_Classification_Pipeline_for_Email_Subjects.ipynb for a simpler pipeline.
Configure the Dataset:
- Update the file_name parameter in the first cell to your dataset’s path (e.g., email_dataset.csv).
- Ensure the file is in the correct format (CSV for advanced notebook, TSV for simplified).
Run the Notebook:
- Execute cells sequentially using Shift+Enter.
- Monitor logging messages for progress (e.g., data loading, preprocessing).
- Check visualizations (e.g., category distribution, confusion matrix).
Test Functionality:
- Verify data loading and preprocessing by inspecting the output of data.head() and data['Subject'].head().
- Evaluate model performance via accuracy and classification reports.
- Test predictions using the GUI (advanced notebook) or manual input (simplified notebook).
- Experiment with data augmentation by modifying the example text in the advanced notebook.

Usage

Running the Advanced Notebook (`Naive_Bayes_Text_Classification.ipynb`)

Data Preparation:
- Loads a CSV dataset and visualizes category distribution.
- Example: Bar plot showing counts of categories (e.g., Spam, Meeting).
Preprocessing:
- Applies lowercase conversion, punctuation removal, tokenization, and stopword removal to email subjects.
Training:
- Splits data (80% train, 20% test), vectorizes text, and trains a Naive Bayes model with hyperparameter tuning (alpha).
Evaluation:
- Outputs accuracy, classification report, confusion matrix heatmap, and cross-validation scores.
- Example: Confusion matrix highlights misclassifications (e.g., Spam vs. Ham).
Prediction:
- Use the GUI to input an email subject and predict its category.
- Example: Input “Urgent meeting today” → Predicts “Meeting”.
Augmentation:
- Generates synonym-based variations of text (e.g., “important meeting” → “crucial gathering”).
Model Saving:
- Saves model and vectorizer to a zip file (model_and_vectorizer.zip).

Running the Simplified Notebook (`Text_Classification_Pipeline_for_Email_Subjects.ipynb`)

Data Preparation:
- Loads a TSV dataset and displays the first few rows.
Preprocessing:
- Similar preprocessing steps as the advanced notebook.
Training:
- Trains a Naive Bayes model without hyperparameter tuning for simplicity.
Evaluation:
- Outputs accuracy and classification report.
Prediction:
- Predicts the category for a single user-provided email subject.
- Example: Input “Free offer now” → Predicts “Spam”.
Model Saving:
- Saves the model to a zip file (model.zip).

Example Workflow

Load dataset (email_dataset.csv).
Run preprocessing to clean subjects (e.g., “Meeting Tomorrow!” → “meeting tomorrow”).
Train the model and achieve ~85% accuracy (dataset-dependent).
Use the GUI to predict: “Project deadline extended” → “Positive”.
Save the model and reuse it for new predictions.

Contributing

We welcome contributions to NLP-Email-Categorizer! To get involved:

Review the Contributing Guidelines for details on submitting issues, feature requests, or pull requests.
Fork the repository, make changes, and submit a pull request.
Adhere to the Code of Conduct to ensure a respectful community.

Example contributions:

Add support for other classifiers (e.g., Logistic Regression, SVM).
Enhance preprocessing with lemmatization or stemming.
Improve the GUI with additional features (e.g., batch predictions).
Create a sample dataset for testing.

Security

Security is a priority for NLP-Email-Categorizer. If you discover a vulnerability:

Report it privately as outlined in the Security Policy.
Avoid public disclosure until the issue is resolved.

Code of Conduct

All contributors and users are expected to follow the Code of Conduct to maintain a welcoming and inclusive environment.

Support

Need help with NLP-Email-Categorizer? Visit the Support page for resources, including:

Filing bug reports or feature requests.
Community discussions and contact information.
FAQs for common issues (e.g., dataset formatting, dependency errors).

License

NLP-Email-Categorizer is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

VoxDroid: For creating and maintaining the project.
scikit-learn: For robust machine learning tools.
NLTK: For powerful NLP preprocessing capabilities.
Contributors: Thanks to all who report issues, suggest features, or contribute code.
NLP Community: For inspiring accessible and practical text classification solutions.

Developed by VoxDroid

Enjoying NLP-Email-Categorizer? Star the project on GitHub!

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github		.github
Colab_Notebooks		Colab_Notebooks
SDG		SDG
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📧 NLP-Email-Categorizer

Table of Contents

Introduction

Features

System Requirements

Installation

Getting Started

Usage

Running the Advanced Notebook (`Naive_Bayes_Text_Classification.ipynb`)

Running the Simplified Notebook (`Text_Classification_Pipeline_for_Email_Subjects.ipynb`)

Example Workflow

Contributing

Security

Code of Conduct

Support

License

Acknowledgements

About

Releases 1

Sponsor this project

Languages

License

VoxDroid/NLP-Email-Categorizer

Folders and files

Latest commit

History

Repository files navigation

📧 NLP-Email-Categorizer

Table of Contents

Introduction

Features

System Requirements

Installation

Getting Started

Usage

Running the Advanced Notebook (Naive_Bayes_Text_Classification.ipynb)

Running the Simplified Notebook (Text_Classification_Pipeline_for_Email_Subjects.ipynb)

Example Workflow

Contributing

Security

Code of Conduct

Support

License

Acknowledgements

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases 1

Sponsor this project

Languages

Running the Advanced Notebook (`Naive_Bayes_Text_Classification.ipynb`)

Running the Simplified Notebook (`Text_Classification_Pipeline_for_Email_Subjects.ipynb`)