Spam Email Classification

Introduction

In today's digital world, spam emails are a constant nuisance. They clog inboxes, waste time, and can even pose security threats containing phishing attempts or malware. This document details the development of a machine learning model designed to tackle this problem by automatically classifying incoming emails as spam or legitimate (ham).

Project Description

This is a Machine learning model to classify emails as spam or non-spam, leveraging Python and popular libraries. The project involved preprocessing email text data, extracting relevant features, and training classification algorithms to achieve high accuracy in spam detection.

Data Overview

The dataset has five columns:
Index: A unique identifier for each email.
#Sent Emails: How many times the email has been sent.
Label: Whether the email is spam or legitimate.
Text: The content of the email.
Binary Label: A 0 or 1 representing spam or not.

Key Achievements

Data Preprocessing: Cleaned and prepared raw email data, including tokenization, stop-word removal, and feature extraction using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).
Model Training: Implemented and evaluated various classification algorithms such as Naive Bayes, Support Vector Machines (SVM), and Logistic Regression to determine the most effective model for spam detection.
Data Exploring and Visualizations: Provided charts to better understand the data:
1. Histogram

Pie Chart

Word cloud for Spam emails

Word cloud for ham emails

Top 30 words in Spam Emails

Top 30 words in ham emails

Model Comparison:

Tested several algorithms to see which works best:

Naïve Bayes

Training Accuracy: 94.00%
Testing Accuracy: 93.33%
Confusion Matrix:

Logistic Regression

Training Accuracy: 97.00%
Testing Accuracy: 96.91%
Confusion Matrix:

SVM

Training Accuracy: 97.34%
Testing Accuracy: 97.29%
Confusion Matrix:

Decision Tree

Training Accuracy: 99.85%
Testing Accuracy: 95.60%
Confusion Matrix:

KNN

Training Accuracy: 95.14%
Testing Accuracy: 90.62%
Confusion Matrix:

Deployment: Created a simple user interface to input emails and classify them in real-time

Contributors

Zeina Wady

Sara Darwish

Ruba AbdELSalam

Basmala Ayman

Sara Habib

Bassant Ahmed

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Images		Images
README.md		README.md
Spam_classification_final_project		Spam_classification_final_project
Spam_classification_final_project.pkl		Spam_classification_final_project.pkl
app.py		app.py
spam_final.ipynb		spam_final.ipynb
spam_ham_dataset.csv		spam_ham_dataset.csv
vectorizer.pkl		vectorizer.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spam Email Classification

Introduction

Project Description

Key Achievements

Model Comparison:

Contributors

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

zeinawady/Spam_Email_Classification

Folders and files

Latest commit

History

Repository files navigation

Spam Email Classification

Introduction

Project Description

Key Achievements

Model Comparison:

Contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages