LLM Text Classification

The following is an open project done under club VLG of IITR, which centred around classification of text by recognising wether or not it has been generated by a Large Language model.

Dataset links:

Given Dataset- https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data Augmented Dataset by MIT - https://www.kaggle.com/datasets/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text

About the data

Given Datset-

-train.csv - Columns for ID , text columns , and generated column for displaying classified labels

Augmented Dataset

contains text columns for classifation and Label column for prediction results

Setup:

setup for each model used is given separately below:

DistilBERT(public score 0.803) :

In this setup I also downloaded the output files after the first Internet "on" run and uploaded them into the input directory so as to save the effort of having to have a first run with intenet on whenever I opened the notebpok again.

RoBERTa (public score 0.672 ):

RoBERTa performed poorer despite being a more adavnced model due to overfitting and lack of optimization to make the suitable tokenizing changes. I implemented them later but some changes were left to be accomodated and time didn't allow for them.

BERT (public score N/A) :

BERT took a very long time to train and I was unsucessful in getting a complete first run on the dataset , so I couldn't save the model in time for submission, and had to leave the tokenzier files in the output directory itself.

I did manage to save the tokenized values but they were later deemed unnecessary due to the firther optimisations I made in the tokenisation process of BERT

Steps Taken:

Gathering of larger datasets
Tokenizinng text for each model
optmizing model for reduced complexity
Random Weight Samling to address overfiiting
Gradiet clipping
Optimizing tokensation lengths for BERT and RoBERTa by implementing function to decide max_len value
Tracking progress by progress lines inside the code
Saving the model weights and tokenized vakues to reduce time taken for subsequent runs in tokensiing
Saved model first into kaggle working directory and then downloaded and uploaded into input directory to enable offline model running

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
BERT.ipynb		BERT.ipynb
DistilBERT.ipynb		DistilBERT.ipynb
README.md		README.md
roberta-model.ipynb		roberta-model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Text Classification

Dataset links:

About the data

Given Datset-

Augmented Dataset

Setup:

DistilBERT(public score 0.803) :

RoBERTa (public score 0.672 ):

BERT (public score N/A) :

Steps Taken:

About

Uh oh!

Releases

Packages

Languages

Swadesh06/LLM-AI-Genrated-Text-Classification

Folders and files

Latest commit

History

Repository files navigation

LLM Text Classification

Dataset links:

About the data

Given Datset-

Augmented Dataset

Setup:

DistilBERT(public score 0.803) :

RoBERTa (public score 0.672 ):

BERT (public score N/A) :

Steps Taken:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages