The following is an open project done under club VLG of IITR, which centred around classification of text by recognising wether or not it has been generated by a Large Language model.
Given Dataset- https://www.kaggle.com/competitions/llm-detect-ai-generated-text/data Augmented Dataset by MIT - https://www.kaggle.com/datasets/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text
-train.csv - Columns for ID , text columns , and generated column for displaying classified labels
- contains text columns for classifation and Label column for prediction results
-
setup for each model used is given separately below:

In this setup I also downloaded the output files after the first Internet "on" run and uploaded them into the input directory so as to save the effort of having to have a first run with intenet on whenever I opened the notebpok again.

RoBERTa performed poorer despite being a more adavnced model due to overfitting and lack of optimization to make the suitable tokenizing changes. I implemented them later but some changes were left to be accomodated and time didn't allow for them.

BERT took a very long time to train and I was unsucessful in getting a complete first run on the dataset , so I couldn't save the model in time for submission, and had to leave the tokenzier files in the output directory itself.
I did manage to save the tokenized values but they were later deemed unnecessary due to the firther optimisations I made in the tokenisation process of BERT
- Gathering of larger datasets
- Tokenizinng text for each model
- optmizing model for reduced complexity
- Random Weight Samling to address overfiiting
- Gradiet clipping
- Optimizing tokensation lengths for BERT and RoBERTa by implementing function to decide max_len value
- Tracking progress by progress lines inside the code
- Saving the model weights and tokenized vakues to reduce time taken for subsequent runs in tokensiing
- Saved model first into kaggle working directory and then downloaded and uploaded into input directory to enable offline model running