Comment Classifier

A machine learning project that automatically identifies spam and toxic comments from YouTube using natural language processing and pre-trained transformer models.

Overview

This project analyzes YouTube comments to classify them as spam or legitimate content by combining:

Toxicity detection using a pre-trained transformer model
Pattern matching for common spam indicators
Text preprocessing and cleaning techniques

Features

Automated Text Cleaning: Removes URLs, mentions, hashtags, and special characters
Dual Classification Approach:
- Toxicity scoring using martin-ha/toxic-comment-model
- Pattern-based spam detection
Comprehensive Analysis: Statistical reporting and visualization
Flexible Thresholds: Customizable spam detection sensitivity

Installation

Requirements

pip install transformers torch torchvision
pip install scikit-learn pandas numpy matplotlib seaborn
pip install nltk wordcloud textblob

Setup

Clone this repository
Install the required packages
Prepare your YouTube comments dataset in CSV format
Run the Jupyter notebook

Usage

1. Data Preparation

Ensure your CSV file contains a CONTENT column with the comment text:

CONTENT,category
"Great video! Thanks for sharing",ham
"Subscribe to my channel! Check out www.spam.com",spam

2. Running the Classifier

# Load and preprocess data
df = pd.read_csv('Youtube-Spam-Dataset.csv')
df['CLEAN_CONTENT'] = df['CONTENT'].apply(clean_text)

# Initialize the classifier
classifier = pipeline('text-classification', model='martin-ha/toxic-comment-model')

# Generate spam labels
df['CLASS_LABEL'] = create_spam_labels(df)

3. View Results

The classifier outputs:

CLASS_LABEL: Binary classification (0 = Clean, 1 = Spam)
Statistical summary of spam detection rates
Visualization charts showing comment distribution

Classification Logic

The spam detection algorithm considers a comment as spam if:

High Pattern Match: 2+ spam patterns detected
Medium Pattern + Toxicity: 1+ patterns AND toxicity score > 0.3
High Toxicity: Toxicity score > 0.7

Spam Patterns Detected:

subscribe, channel, check out, follow me
my channel, visit, website
URLs (.com, www)

Text Preprocessing

The clean_text() function performs:

Converts to lowercase
Removes URLs (http://, www., https://)
Removes mentions (@username) and hashtags (#tag)
Strips special characters and extra whitespace
Handles missing/null values

Model Performance

The classifier uses the pre-trained martin-ha/toxic-comment-model which provides:

Toxicity probability scores
Binary toxic/non-toxic classification
Robust performance on social media content

Output Visualization

The project generates:

Pie Chart: Distribution of clean vs spam comments
Statistical Summary: Total counts and spam percentage
Color-coded Results: Green for clean, red for spam

Customization

Adjusting Sensitivity

Modify the thresholds in create_spam_labels():

# More sensitive detection
if pattern_matches >= 1:  # Lower threshold
    spam_label = 1
elif toxicity > 0.5:  # Lower toxicity threshold
    spam_label = 1

Adding Custom Patterns

Extend the spam patterns list:

spam_patterns = [
    r'subscribe', r'channel', r'check out',
    r'like and subscribe',  # Add custom patterns
    r'hit the bell',
    r'free download'
]

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/improvement)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/improvement)
Create a Pull Request

License

This project is open source and available under the MIT License.

Acknowledgments

martin-ha/toxic-comment-model for the pre-trained toxicity classifier
Hugging Face Transformers for the pipeline infrastructure
YouTube Spam Dataset contributors

Support

If you encounter any issues or have questions:

Check the Issues section
Review the code comments and documentation
Submit a new issue with detailed information

Note: This classifier is designed for educational and research purposes. Always review automated classifications and consider implementing human moderation for production systems.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Comment_Classifier.ipynb		Comment_Classifier.ipynb
LICENSE		LICENSE
README.md		README.md
Youtube-Spam-Dataset.csv		Youtube-Spam-Dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Comment Classifier

Overview

Features

Installation

Requirements

Setup

Usage

1. Data Preparation

2. Running the Classifier

3. View Results

Classification Logic

Spam Patterns Detected:

Text Preprocessing

Model Performance

Output Visualization

Customization

Adjusting Sensitivity

Adding Custom Patterns

Contributing

License

Acknowledgments

Support

About

Uh oh!

Releases

Packages

Languages

License

yasithrashan/comment-classifier

Folders and files

Latest commit

History

Repository files navigation

Comment Classifier

Overview

Features

Installation

Requirements

Setup

Usage

1. Data Preparation

2. Running the Classifier

3. View Results

Classification Logic

Spam Patterns Detected:

Text Preprocessing

Model Performance

Output Visualization

Customization

Adjusting Sensitivity

Adding Custom Patterns

Contributing

License

Acknowledgments

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages