Email Spam Detection

This project is an Email Spam Detection Model developed using Python and machine learning techniques. It classifies emails as spam or not spam by analyzing the email content. The final model utilizes the Random Forest Classifier, which demonstrated high accuracy and reliability in distinguishing spam from non-spam emails.

Project Overview

This project follows a typical machine learning workflow, including data preprocessing, balancing, text processing, feature extraction, model training, evaluation, and saving the final model.

Key Steps:

Data Preprocessing:
- Loaded the spam dataset.
- Cleaned the data by removing unnecessary columns, handling missing values, and removing duplicates.
- Encoded target labels for binary classification (spam and non-spam).
Balancing the Dataset:
- The dataset was imbalanced, with more non-spam than spam messages.
- Used Random Over Sampling to balance the dataset, making the model more robust in identifying spam.
Text Processing:
- Preprocessed email content by removing punctuation, numbers, and stopwords, and then lemmatizing the text.
- Added a new feature representing the length of each email for further analysis.
Visualization:
- Plotted the distribution of email lengths for spam and non-spam emails.
- Created word clouds to visualize commonly occurring words in both spam and non-spam emails.
Feature Extraction:
- Used TF-IDF Vectorization to transform the email text into numerical features, limiting the features to the most relevant 3000 words.
Model Training:
- Tried multiple machine learning algorithms: Logistic Regression, Naive Bayes, SVM, and Random Forest Classifier.
- Each model was evaluated based on accuracy, precision, recall, F1 score, and confusion matrix.
Final Model Selection:
- Chose Random Forest Classifier as the final model due to its superior performance.
Saving the Model:
- Saved the trained model as a pickle file (spam_detector_model.pkl) for future use.

Installation

Clone the repository:

git clone https://github.com/your-username/Email-Spam-Detection.git
cd Email-Spam-Detection

Install required packages:
```
pip install -r requirements.txt
```
Download NLTK resources (stopwords and wordnet) if not already installed:
```
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
```

Usage

Training the Model: Run the notebook or script to preprocess the data, train the model, and save the final classifier.
Prediction: Load spam_detector_model.pkl and use it to classify new email content as spam or not spam.

Evaluation

The model was evaluated using the following metrics:

Accuracy: Measures the overall correct predictions.
Precision: Focuses on the accuracy of spam predictions.
Recall: Assesses the model’s ability to capture all spam emails.
F1 Score: Balances precision and recall to measure the model's accuracy.

The Random Forest Classifier yielded the best results, with minimal misclassification and high reliability.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
file.ipynb		file.ipynb
spam.csv		spam.csv
spam_detector_model.pkl		spam_detector_model.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Email Spam Detection

Project Overview

Key Steps:

Installation

Usage

Evaluation

About

Uh oh!

Releases

Packages

Languages

khadiijaaaaaaa/Email-Spam-Detection

Folders and files

Latest commit

History

Repository files navigation

Email Spam Detection

Project Overview

Key Steps:

Installation

Usage

Evaluation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages