Skip to content

The following is an open project done under club VLG of IITR, which centred around classification of text by recognising wether or not it has been generated by a Large Language model.

Notifications You must be signed in to change notification settings

Swadesh06/LLM-AI-Genrated-Text-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LLM Text Classification

The following is an open project done under club VLG of IITR, which centred around classification of text by recognising wether or not it has been generated by a Large Language model.

download

Dataset links:

Given Dataset- https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data Augmented Dataset by MIT - https://www.kaggle.com/datasets/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text

About the data

Given Datset-

-train.csv - Columns for ID , text columns , and generated column for displaying classified labels

Augmented Dataset

  • contains text columns for classifation and Label column for prediction results

Setup:

  • setup for each model used is given separately below:

    DistilBERT(public score 0.803) :

Screenshot 2024-01-15 at 10 00 56 PM

In this setup I also downloaded the output files after the first Internet "on" run and uploaded them into the input directory so as to save the effort of having to have a first run with intenet on whenever I opened the notebpok again.

RoBERTa (public score 0.672 ):

Screenshot 2024-01-15 at 10 10 20 PM

RoBERTa performed poorer despite being a more adavnced model due to overfitting and lack of optimization to make the suitable tokenizing changes. I implemented them later but some changes were left to be accomodated and time didn't allow for them.

BERT (public score N/A) :

Screenshot 2024-01-15 at 10 04 34 PM

BERT took a very long time to train and I was unsucessful in getting a complete first run on the dataset , so I couldn't save the model in time for submission, and had to leave the tokenzier files in the output directory itself.

I did manage to save the tokenized values but they were later deemed unnecessary due to the firther optimisations I made in the tokenisation process of BERT

Steps Taken:

  • Gathering of larger datasets
  • Tokenizinng text for each model
  • optmizing model for reduced complexity
  • Random Weight Samling to address overfiiting
  • Gradiet clipping
  • Optimizing tokensation lengths for BERT and RoBERTa by implementing function to decide max_len value
  • Tracking progress by progress lines inside the code
  • Saving the model weights and tokenized vakues to reduce time taken for subsequent runs in tokensiing
  • Saved model first into kaggle working directory and then downloaded and uploaded into input directory to enable offline model running

About

The following is an open project done under club VLG of IITR, which centred around classification of text by recognising wether or not it has been generated by a Large Language model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published