This repository contains scripts and tools for text analysis research, focusing on reflection datasets annotated by multiple individuals. The workflow includes dataset preparation, preprocessing, fine-tuning transformer-based models, and clustering using embeddings.
pandas openpyxl numpy torch nltk transformers scikit-learn seaborn matplotlib datasets
- Dataset Preparation Scripts
- prepare_dataset.py : Prepares the dataset using rare reflection data annotated by multiple people.
- Utility Scripts
- data_preprocessing.py : Contains preprocessing data for text and data preparation.
- functions.py : Utility functions used across the fine-tuning and clustering workflows.
- Main Scripts
- finetune_model_bert.py : Trains a transformer-based model (BERT) on the reflection dataset to generate embeddings for downstream tasks.
- clustering.py : Performs clustering using the embeddings generated by the transformer model.