This project addresses the pervasive issue of toxic commentary online, which poses a significant threat to the integrity of digital discourse. Toxic comments can create environments of insecurity, obstruct productive exchanges, and lead to severe psychological effects on the recipients. The objective of this project is to detect and categorize such toxic comments, thereby contributing to the cultivation of a secure and respectful online community.
The multifaceted nature of toxic comments, which includes a spectrum of derogatory, insulting, or outright harassing language, requires a nuanced approach for detection and classification. This project employs a multichannel Convolutional Neural Network (CNN) architecture, which is particularly adept at identifying complex patterns within text data. By processing various representations of text input in parallel, the multichannel CNN enhances the model's sensitivity to the intricate features of language used in toxic comments. The model has demonstrated excellent performance, achieving an accuracy of 95% on the training data and 97% on the validation data.
The dataset leveraged for this project is derived from the Jigsaw Toxic Comment Classification Challenge on Kaggle, featuring a comprehensive set of comments from Wikipedia discussions tagged for various degrees of toxicity.
For textual analysis, the project utilizes 100-dimensional GloVe word embeddings, which provide a dense representation of words and their contextual meanings. These embeddings play a pivotal role in enabling the neural network to interpret and evaluate the textual data effectively. Access the GloVe embeddings from the table below.
Each component of the project plays a critical role in the overall functionality:
- Streamlit App: This is the web application interface where users can interact with the model for making predictions.
- Model: The trained multichannel CNN model file which is used for prediction.
- Tokenizer Pickle: This contains the tokenization mapping used by the model to convert text data into a format that can be processed.
- Youtube Comment : This dataset is taken from YouTube comments. It is used with the purpose of testing the effectiveness level in the application of the machine learning model that has been built
Description | Link |
---|---|
Dataset | Kaggle Challenge |
Word Embeddings | GloVe 100D |
Streamlit App | Streamlit Interface |
Model | CNN Model |
Tokenizer Pickle | Tokenizer |
Streamlit Deploy | Streamlit Deploy App |
YoutubeComment | You Can Dowload in File Above |
- Robby Hidayah Ramadhan - 120450033
- Muhammad Aqsal Fadillah - 120450077
- Nadhira Adela Putri - 12045001
- Wulan Ayu Windari - 120450045
Deep Learning Project [RA] - 2023