📊 AI or Human? A Machine Learning Approach to Text Classification for Statistical Courses

Course: STT 811 – Applied Statistics Modeling for Data Scientists
Instructor: Savvy Barnes
Contributors: Andrew John J, Roshni Bhowmik, Mahnoor Sheikh, Ab Basit Syed Rafi

🔗 Live App

🌐 Streamlit App: STT811 Text Classification App
📁 GitHub Repo: STT811_StatsProject

📌 Project Overview

With the increasing use of AI tools like ChatGPT in academia, distinguishing between human- and AI-generated responses is essential for maintaining academic integrity. This project explores a machine learning pipeline to classify text as human- or AI-generated based on linguistic and semantic features.

📂 Dataset

Source: Custom dataset of 2,239 rows (from Mendeley)
Contents:
- Question: The original statistics question
- Human Response: Text response from a student
- AI Response: Text generated using a language model
Post-cleaning: 1,993 usable examples

⚙️ Preprocessing & Feature Engineering

Cleaning: Lowercasing, punctuation removal, tokenization, stopword removal
Feature Creation:
- Text length, special character counts
- Flesch Reading Ease, Gunning Fog Index
- Cosine similarity to question
- Sentiment scores and sentiment gaps
Vectorization: CountVectorizer followed by PCA (95% variance retained in 482 components)

📊 Exploratory Data Analysis

Key visuals and insights:

Top Trigrams and Common Words in AI vs. Human responses
Word Clouds and Text Length Distribution
Sentiment Gap Analysis and KDE Estimation
Readability Scores: AI responses are longer and more formulaic
Text Similarity: AI more aligned with original questions
Pairplots & Correlation Heatmaps reveal subtle response patterns

🤖 Modeling

Traditional ML Models

Logistic Regression, Linear SVM, Decision Tree, Random Forest, KNN, Gradient Boosting, MLP
Best Accuracy: ~85% (Logistic Regression, SVM, MLP)

Deep Learning: BERT

Model: bert-base-uncased via Hugging Face
Training:
- Tokenization (WordPiece)
- 30 epochs with cross-entropy loss
- AdamW optimizer
Performance: Comparable to traditional models with potential for further gains

📱 Streamlit App Features

Upload new questions and responses
Evaluate text using trained models
Visual analytics: word clouds, trigrams, readability, sentiment
Compare AI vs. human characteristics interactively

📌 Key Takeaways

Human responses were simpler, less verbose, and showed more variability
AI responses were longer, sentimentally aligned with questions, and structurally consistent
Readability, sentiment gap, and cosine similarity are strong distinguishing features
The system offers a foundational step toward detecting AI-generated content in education

📚 References

📦 Installation & Usage

# Clone repo
git clone https://github.com/andrew-jxhn/STT811_StatsProject.git
cd STT811_StatsProject

# Create virtual environment (optional)
python -m venv venv
source venv/bin/activate  # or .\venv\Scripts\activate on Windows

# Install dependencies
pip install -r requirements.txt

# Run Streamlit app
streamlit run streamlit_code.py

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
Streamlit		Streamlit
AI classifier dataset.zip		AI classifier dataset.zip
Bert.ipynb		Bert.ipynb
Model.ipynb		Model.ipynb
Model_v2.ipynb		Model_v2.ipynb
Model_v2_updated.ipynb		Model_v2_updated.ipynb
Model_v3_final.ipynb		Model_v3_final.ipynb
README.md		README.md
STT 811 - Project Report - Final.pdf		STT 811 - Project Report - Final.pdf
STT 811 IDA and EDA.ipynb		STT 811 IDA and EDA.ipynb
STT811 QR.pptx		STT811 QR.pptx
aidata_clean_avg.csv		aidata_clean_avg.csv
combined.xls		combined.xls
profiling_report.html		profiling_report.html
requirements.txt		requirements.txt
statistics_background_transparent_dark.png		statistics_background_transparent_dark.png
streamlit_code.py		streamlit_code.py
stt811statsproject_qr.png		stt811statsproject_qr.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📊 AI or Human? A Machine Learning Approach to Text Classification for Statistical Courses

🔗 Live App

📌 Project Overview

📂 Dataset

⚙️ Preprocessing & Feature Engineering

📊 Exploratory Data Analysis

🤖 Modeling

Traditional ML Models

Deep Learning: BERT

📱 Streamlit App Features

📌 Key Takeaways

📚 References

📦 Installation & Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

andrew-jxhn/STT811_StatsProject

Folders and files

Latest commit

History

Repository files navigation

📊 AI or Human? A Machine Learning Approach to Text Classification for Statistical Courses

🔗 Live App

📌 Project Overview

📂 Dataset

⚙️ Preprocessing & Feature Engineering

📊 Exploratory Data Analysis

🤖 Modeling

Traditional ML Models

Deep Learning: BERT

📱 Streamlit App Features

📌 Key Takeaways

📚 References

📦 Installation & Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages