Toxic Comment Classification Challenge

Overview

This project is a solution to the Jigsaw Toxic Comment Classification Challenge, hosted on Kaggle. The goal is to develop a model that can classify online comments into different categories of toxicity.

Dataset

The dataset consists of comments labeled for different types of toxicity, including:

Toxic
Severe Toxic
Obscene
Threat
Insult
Identity Hate

The dataset is available on Kaggle and includes both training and test sets.

Methodology

The project follows these steps:

Data Exploration:
- Checking dataset structure
- Visualizing word frequency and toxicity distribution
Preprocessing:
- Text cleaning (removing punctuation, stopwords, and special characters)
- Tokenization and stemming/lemmatization
- Converting text to numerical representations using TF-IDF or embeddings
Modeling:
- Training machine learning models such as Logistic Regression, Naive Bayes, or deep learning models like LSTMs and Transformers (e.g., BERT)
- Hyperparameter tuning and cross-validation
Evaluation:
- Performance assessment using metrics like AUC-ROC, accuracy, precision, recall, and F1-score
- Confusion matrices and classification reports
Prediction & Submission:
- Making predictions on the test dataset
- Formatting and submitting results to Kaggle

Requirements

To run this project, install the following dependencies:

pip install numpy pandas matplotlib seaborn scikit-learn tensorflow torch transformers

Usage

To execute the notebook:

Download the dataset from Kaggle.
Run the notebook step by step.
Modify hyperparameters and try different models for better performance.

Results & Findings

Toxicity classes are highly imbalanced, most comments are not toxic, and some categories like threat and identity hate have very few positive examples.
There are comments with misspellings, non-standard grammar, and slang that make preprocessing more crucial
Single comment can belong to several categories.

Future Improvements

Experiment with different architectures such as Transformer-based models.
Improve text preprocessing techniques.
Perform data augmentation to balance the dataset.

Acknowledgments

Thanks to Kaggle and the Jigsaw team for providing this challenge and dataset. This project was developed using open-source libraries like Scikit-learn, TensorFlow, and PyTorch.

Author

[Adilet Akimshe]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
toxic_comment_identification.ipynb		toxic_comment_identification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Toxic Comment Classification Challenge

Overview

Dataset

Methodology

Requirements

Usage

Results & Findings

Future Improvements

Acknowledgments

Author

About

Uh oh!

Releases

Packages

Languages

License

proxy-pylon/nlp-toxicity-detection

Folders and files

Latest commit

History

Repository files navigation

Toxic Comment Classification Challenge

Overview

Dataset

Methodology

Requirements

Usage

Results & Findings

Future Improvements

Acknowledgments

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages