Skip to content

Alejandro-FA/UPF-Detect-AI-Generated-Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM - Detect AI Generated Text (Visual Analytics - UPF)

To replicate the development environment simply run the following command (you can change the name of the environment from vis_analytics to something else).

conda env create --name vis_analytics --file environment.yml
conda activate vis_analytics

If yor system does support CUDA, it is recommended to uncomment the pytorch-cuda requirement from the .yml file and to uncomment the nvidia channel.

Alternatively, we also provide a pip requirements.txt file. Please take into account that the project has been developed with python 3.11. We have not tested if the code works with other versions of python.

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt

Then you can simply run the streamlit app as follows:

streamlit run app/Welcome.py

The first execution of the app will download our pre-trained model, the AI-generated text detection model from SimpleAI and some datasets that we use. Depending on your internet connection this process could take several minutes, please be patient.

These files are too large to be uploaded to GitHub.

Project description and expected benefits

The goal of the project is to build a classification model that is capable of accurately detecting whether a text has been generated by a LLM or by a student. The purpose of the model is to improve plagiarism detection tools in this new learning context defined by AI. This project has been motivated by this competition hosted in Kaggle.

Since we don't expect to get great accuracy results (especially in such short time), we will put a lot of emphasis in explainable AI. The idea is to build a tool capable of detecting potential plagiarism candidates, and then leave the final call to a human (normally a professor). To help this decision, information about why the model has predicted plagiarism (e.g. using SHAP) will be given to the professor.

Required data sources

Jules King, Perpetual Baffour, Scott Crossley, Ryan Holbrook, Maggie Demkin. (2023). LLM - Detect AI Generated Text. Kaggle. https://kaggle.com/competitions/llm-detect-ai-generated-text

https://www.kaggle.com/datasets/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text/data

Expected results/delivery/output

The delivery of the project will include:

  • Python scripts with:

    • Exploratory Data Analysis and feature engineering
    • Training a Machine Learning model to classify texts (between AI-generated and human-generated)
    • Model performance evaluation
  • A streamlit webapp to present the results and use Explainable AI plots to interpret predictions.

Visualization method

We will build a Streamlit webapp to present the methodologies used, the analysis of the data, the Machine Learning model and an interpretation of the results (Explainable AI). We will design the webapp with the goal of using storytelling techniques for the presentation.

Useful documentation for the project

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages