Skip to content

This project addresses the pervasive issue of toxic commentary online, which poses a significant threat to the integrity of digital discourse. Toxic comments can create environments of insecurity, obstruct productive exchanges, and lead to severe psychological effects on the recipients. The objective of this project is to detect and categorize such

Notifications You must be signed in to change notification settings

sains-data/Toxic_Comment_Binary_Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multichannel CNN for Toxic Comment Classification

This project addresses the pervasive issue of toxic commentary online, which poses a significant threat to the integrity of digital discourse. Toxic comments can create environments of insecurity, obstruct productive exchanges, and lead to severe psychological effects on the recipients. The objective of this project is to detect and categorize such toxic comments, thereby contributing to the cultivation of a secure and respectful online community.

Project Overview

The multifaceted nature of toxic comments, which includes a spectrum of derogatory, insulting, or outright harassing language, requires a nuanced approach for detection and classification. This project employs a multichannel Convolutional Neural Network (CNN) architecture, which is particularly adept at identifying complex patterns within text data. By processing various representations of text input in parallel, the multichannel CNN enhances the model's sensitivity to the intricate features of language used in toxic comments. The model has demonstrated excellent performance, achieving an accuracy of 95% on the training data and 97% on the validation data.

Dataset

The dataset leveraged for this project is derived from the Jigsaw Toxic Comment Classification Challenge on Kaggle, featuring a comprehensive set of comments from Wikipedia discussions tagged for various degrees of toxicity.

Word Embeddings

For textual analysis, the project utilizes 100-dimensional GloVe word embeddings, which provide a dense representation of words and their contextual meanings. These embeddings play a pivotal role in enabling the neural network to interpret and evaluate the textual data effectively. Access the GloVe embeddings from the table below.

How to Use

Each component of the project plays a critical role in the overall functionality:

  • Streamlit App: This is the web application interface where users can interact with the model for making predictions.
  • Model: The trained multichannel CNN model file which is used for prediction.
  • Tokenizer Pickle: This contains the tokenization mapping used by the model to convert text data into a format that can be processed.
  • Youtube Comment : This dataset is taken from YouTube comments. It is used with the purpose of testing the effectiveness level in the application of the machine learning model that has been built

Useful Links Table

Description Link
Dataset Kaggle Challenge
Word Embeddings GloVe 100D
Streamlit App Streamlit Interface
Model CNN Model
Tokenizer Pickle Tokenizer
Streamlit Deploy Streamlit Deploy App
YoutubeComment You Can Dowload in File Above

Contributors

  1. Robby Hidayah Ramadhan - 120450033
  2. Muhammad Aqsal Fadillah - 120450077
  3. Nadhira Adela Putri - 12045001
  4. Wulan Ayu Windari - 120450045

License

Deep Learning Project [RA] - 2023

About

This project addresses the pervasive issue of toxic commentary online, which poses a significant threat to the integrity of digital discourse. Toxic comments can create environments of insecurity, obstruct productive exchanges, and lead to severe psychological effects on the recipients. The objective of this project is to detect and categorize such

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages