Skip to content

This project is designed to classify YouTube comments as toxic or non-toxic using BERT (Bidirectional Encoder Representations from Transformers). By fine-tuning a pre-trained BERT model, we leverage state-of-the-art NLP capabilities to identify harmful content in online conversations.

Notifications You must be signed in to change notification settings

Blacksujit/Youtube-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

🎯 YouTube Toxic Comment Classification Using BERT

image

🚀 Overview

This project is designed to classify YouTube comments as toxic or non-toxic using BERT (Bidirectional Encoder Representations from Transformers). By fine-tuning a pre-trained BERT model, we leverage state-of-the-art NLP capabilities to identify harmful content in online conversations.

The model is trained to identify toxicity, which is crucial for creating safer and more respectful online platforms. This project can be extended for different toxicity levels or used in moderation tools.


🔎 Demo

Here's a quick demo of the toxicity classification tool:

Toxicity Prediction Demo


🗂️ Table of Contents

  1. Dataset
  2. Installation
  3. Data Preprocessing
  4. Model Architecture
  5. Training
  6. Evaluation
  7. Results
  8. Usage
  9. Future Improvements
  10. License

📊 Dataset

The dataset consists of YouTube comments with corresponding labels indicating whether the comment is toxic (1) or non-toxic (0). This binary classification problem is aimed at improving content moderation in online discussions.

  • Columns:
    • comment_id: Unique identifier for each comment.
    • content: Text of the comment.
    • label: 1 for toxic comments, 0 for non-toxic comments.

💻 Installation

To get started with this project, follow these instructions:

Prerequisites

  • Python 3.7+
  • PyTorch 1.6+
  • Hugging Face Transformers library
  • CUDA-enabled GPU (optional but recommended)

Setup

  1. Clone the Repository:
    git clone https://github.com/your-repository/youtube-toxic-comment-classification.git
    cd youtube-toxic-comment-classification

🛠️ Data Preprocessing

The raw comments are preprocessed before being fed into the BERT model. This includes:

  • Removing URLs, special characters, and extra spaces.
  • Converting all text to lowercase.
  • Tokenizing using BERT tokenizer, which converts the text into input tokens compatible with BERT.

🧠 Model Architecture:

BERT (Bidirectional Encoder Representations from Transformers) This project uses BERT to handle the NLP task of toxic comment detection. BERT is a transformer-based model that understands the context of words in sentences, making it highly effective for text classification tasks.

Key Components:

Tokenizer: Converts sentences into token IDs.

BERT for Sequence Classification: Pre-trained BERT model fine-tuned for binary classification.

📊 Results:

After fine-tuning BERT, we achieved the following performance metrics:

1.) Accuracy: ~90%

2.) F1 Score: ~0.85

3.) Precision: ~0.88

4.) Recall: ~0.83

📈 Future Improvements

1.) Multi-label Classification: Extend the model to classify different types of toxicity (e.g., hate speech, threats, etc.).

2.) Data Augmentation: Generate synthetic examples to address class imbalance.

3.) Model Optimization: Experiment with models like DistilBERT for faster performance.

4.) Multi-language Support: Expand to detect toxicity in different languages.

📜 License:

This project is licensed under the MIT License. See the LICENSE file for more details.

About

This project is designed to classify YouTube comments as toxic or non-toxic using BERT (Bidirectional Encoder Representations from Transformers). By fine-tuning a pre-trained BERT model, we leverage state-of-the-art NLP capabilities to identify harmful content in online conversations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published