This project is a solution to the Jigsaw Toxic Comment Classification Challenge, hosted on Kaggle. The goal is to develop a model that can classify online comments into different categories of toxicity.
The dataset consists of comments labeled for different types of toxicity, including:
- Toxic
- Severe Toxic
- Obscene
- Threat
- Insult
- Identity Hate
The dataset is available on Kaggle and includes both training and test sets.
The project follows these steps:
-
Data Exploration:
- Checking dataset structure
- Visualizing word frequency and toxicity distribution
-
Preprocessing:
- Text cleaning (removing punctuation, stopwords, and special characters)
- Tokenization and stemming/lemmatization
- Converting text to numerical representations using TF-IDF or embeddings
-
Modeling:
- Training machine learning models such as Logistic Regression, Naive Bayes, or deep learning models like LSTMs and Transformers (e.g., BERT)
- Hyperparameter tuning and cross-validation
-
Evaluation:
- Performance assessment using metrics like AUC-ROC, accuracy, precision, recall, and F1-score
- Confusion matrices and classification reports
-
Prediction & Submission:
- Making predictions on the test dataset
- Formatting and submitting results to Kaggle
To run this project, install the following dependencies:
pip install numpy pandas matplotlib seaborn scikit-learn tensorflow torch transformers
To execute the notebook:
- Download the dataset from Kaggle.
- Run the notebook step by step.
- Modify hyperparameters and try different models for better performance.
- Toxicity classes are highly imbalanced, most comments are not toxic, and some categories like threat and identity hate have very few positive examples.
- There are comments with misspellings, non-standard grammar, and slang that make preprocessing more crucial
- Single comment can belong to several categories.
- Experiment with different architectures such as Transformer-based models.
- Improve text preprocessing techniques.
- Perform data augmentation to balance the dataset.
Thanks to Kaggle and the Jigsaw team for providing this challenge and dataset. This project was developed using open-source libraries like Scikit-learn, TensorFlow, and PyTorch.
[Adilet Akimshe]