To replicate the development environment simply run the following command (you can change the name of the environment from vis_analytics
to something else).
conda env create --name vis_analytics --file environment.yml
conda activate vis_analytics
If yor system does support CUDA, it is recommended to uncomment the
pytorch-cuda
requirement from the.yml
file and to uncomment thenvidia
channel.
Alternatively, we also provide a pip
requirements.txt
file. Please take into account that the project has been developed with python 3.11
. We have not tested if the code works with other versions of python
.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt
Then you can simply run the streamlit app as follows:
streamlit run app/Welcome.py
The first execution of the app will download our pre-trained model, the AI-generated text detection model from SimpleAI and some datasets that we use. Depending on your internet connection this process could take several minutes, please be patient.
These files are too large to be uploaded to GitHub.
The goal of the project is to build a classification model that is capable of accurately detecting whether a text has been generated by a LLM or by a student. The purpose of the model is to improve plagiarism detection tools in this new learning context defined by AI. This project has been motivated by this competition hosted in Kaggle.
Since we don't expect to get great accuracy results (especially in such short time), we will put a lot of emphasis in explainable AI. The idea is to build a tool capable of detecting potential plagiarism candidates, and then leave the final call to a human (normally a professor). To help this decision, information about why the model has predicted plagiarism (e.g. using SHAP) will be given to the professor.
Jules King, Perpetual Baffour, Scott Crossley, Ryan Holbrook, Maggie Demkin. (2023). LLM - Detect AI Generated Text. Kaggle. https://kaggle.com/competitions/llm-detect-ai-generated-text
https://www.kaggle.com/datasets/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text/data
The delivery of the project will include:
-
Python scripts with:
- Exploratory Data Analysis and feature engineering
- Training a Machine Learning model to classify texts (between AI-generated and human-generated)
- Model performance evaluation
-
A streamlit webapp to present the results and use
Explainable AI
plots to interpret predictions.
We will build a Streamlit webapp to present the methodologies used, the analysis of the data, the Machine Learning model and an interpretation of the results (Explainable AI). We will design the webapp with the goal of using storytelling techniques for the presentation.