🤖 Human-AI Detection Language Model

This repository contains a machine learning model for detecting whether a human or AI-generated a message. The model uses custom data and ChatGPT from OpenAI to perform binary classification tasks, aiming to distinguish between user messages (human) and assistant responses (AI).

🧠 Project Overview

This project aims to build a binary classification model to detect whether a given message is from a human or an AI. The process involves several stages, from data cleaning and processing to model training and evaluation.

🔑 Key Steps:

📊 Data Splitting:
- Split the original dataset into two distinct data frames: Messages and Roles.
  - Messages: Contains the actual text content.
  - Roles: Contains labels for each message indicating whether the message is from the User (1) or the Assistant (0).
🏷️ Labeling Roles:
- Label the roles of the messages:
  - User = 1
  - Assistant = 0
🧹 Text Cleaning & Preprocessing:
- Text Cleaning: Remove any unnecessary characters or noise from the messages.
- Text Normalization: Standardize the text format (e.g., converting everything to lowercase).
- Stopwords Removal: Remove common stopwords that don't contribute much to the meaning of the text.
- Lemmatization: Convert words to their base form (e.g., "running" becomes "run").
🌐 Word Cloud Generation:
- After preprocessing the data, a word cloud is generated to visualize the most common words in the dataset.
🔢 Vectorization (Word Embedding):
- Use word embeddings to convert the textual data into numerical format suitable for machine learning models.
🔄 Data Splitting:
- Split the data into training and test sets to evaluate model performance.
📉 Model Training:
- Logistic Regression: Train a binary classification model using logistic regression.
- Two training approaches:
  - Without Grid Search Cross-Validation.
  - With Grid Search Cross-Validation for hyperparameter tuning.
📊 Model Evaluation:
- Test and evaluate the model's performance on the test data, assessing its accuracy and other metrics.

⚠️ Challenges and Decisions

🌍 Language Detection: Initially, the goal was to use the Detect API to automatically detect the language of each message. However, the API proved to be too slow for processing large batches of text, so it was removed from the pipeline. The preprocessing continued without using language detection, and no issues were encountered with the bilingual nature of the dataset (English-French).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Human_AI_Detection_ML.ipynb		Human_AI_Detection_ML.ipynb
Prepare-My-Data.ipynb		Prepare-My-Data.ipynb
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤖 Human-AI Detection Language Model

🧠 Project Overview

🔑 Key Steps:

⚠️ Challenges and Decisions

About

Uh oh!

Releases

Packages

Languages

Majdi21926/Human-AI-Detection-Language-Model

Folders and files

Latest commit

History

Repository files navigation

🤖 Human-AI Detection Language Model

🧠 Project Overview

🔑 Key Steps:

⚠️ Challenges and Decisions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages