This repository contains a machine learning model for detecting whether a human or AI-generated a message. The model uses custom data and ChatGPT from OpenAI to perform binary classification tasks, aiming to distinguish between user messages (human) and assistant responses (AI).
This project aims to build a binary classification model to detect whether a given message is from a human or an AI. The process involves several stages, from data cleaning and processing to model training and evaluation.
- π Data Splitting:
- Split the original dataset into two distinct data frames: Messages and Roles.
- Messages: Contains the actual text content.
- Roles: Contains labels for each message indicating whether the message is from the User (1) or the Assistant (0).
- Split the original dataset into two distinct data frames: Messages and Roles.
- π·οΈ Labeling Roles:
- Label the roles of the messages:
- User = 1
- Assistant = 0
- Label the roles of the messages:
- π§Ή Text Cleaning & Preprocessing:
- Text Cleaning: Remove any unnecessary characters or noise from the messages.
- Text Normalization: Standardize the text format (e.g., converting everything to lowercase).
- Stopwords Removal: Remove common stopwords that don't contribute much to the meaning of the text.
- Lemmatization: Convert words to their base form (e.g., "running" becomes "run").
- π Word Cloud Generation:
- After preprocessing the data, a word cloud is generated to visualize the most common words in the dataset.
- π’ Vectorization (Word Embedding):
- Use word embeddings to convert the textual data into numerical format suitable for machine learning models.
- π Data Splitting:
- Split the data into training and test sets to evaluate model performance.
- π Model Training:
- Logistic Regression: Train a binary classification model using logistic regression.
- Two training approaches:
- Without Grid Search Cross-Validation.
- With Grid Search Cross-Validation for hyperparameter tuning.
- π Model Evaluation:
- Test and evaluate the model's performance on the test data, assessing its accuracy and other metrics.
- π Language Detection: Initially, the goal was to use the Detect API to automatically detect the language of each message. However, the API proved to be too slow for processing large batches of text, so it was removed from the pipeline. The preprocessing continued without using language detection, and no issues were encountered with the bilingual nature of the dataset (English-French).