This project leverages a BERT-based deep learning model to classify text articles as either AI-generated or human-written. Using PyTorch and the Hugging Face transformers
library, the project implements fine-tuning of a pre-trained BERT model for binary classification.
The primary goal of this project is to classify text data into two categories:
- AI-generated
- Human-written
The workflow includes:
- Tokenizing text data using a BERT tokenizer.
- Defining a PyTorch dataset and data loader for text and labels.
- Building and training a custom BERT-based classifier.
- Evaluating the model using stratified cross-validation.
- Saving the trained model for deployment or further analysis.
- Pre-trained BERT Model: Fine-tunes
bert-base-uncased
for text classification. - Custom Dataset Class: Implements a PyTorch-compatible dataset class for efficient data handling.
- Cross-Validation: Uses Stratified K-Fold cross-validation to ensure robust evaluation.
- Evaluation Metrics: Calculates accuracy, F1 score, precision, and recall.
- Python 3.7+
- Libraries:
torch
transformers
pandas
numpy
sklearn
- Experiment with advanced pre-trained models like RoBERTa or DeBERTa.
- Handle class imbalance with techniques like oversampling or weighted loss functions.
- Extend the model to support multiclass classification for other types of text.