This repository contains the code and data preprocessing scripts for our Master’s thesis project, which focuses on explicit discourse parsing using fine-tuned LLaMA models. The primary goal is to detect discourse connectives and classify their relations using two different annotation approaches: Numbered Tagging and Bracket Annotation.
Our methodology involves transforming the Penn Discourse TreeBank (PDTB) dataset into structured formats suitable for model training, followed by fine-tuning LLaMA models on these transformed datasets and evaluating their performance.
Preprocessing_PDTB_into_Numbered_Tagging-Bracket_Annotation.ipynb
- This notebook processes raw PDTB text files into Numbered Tagging and Bracket Annotation formats, making them suitable for fine-tuning the LLaMA models.
- Contains exploration of the PCC German discourse corpus, FTDB French discourse corpus, and Czech discourse corpus.
LLamafactory_evaluating_model_prediction.ipynb
- This notebook evaluates the performance of the fine-tuned LlaMA model using the bracketing approach.
Unllama_Model_Evaluation.ipynb
- This notebook evaluates the LS-unLLaMA fine-tuned model using the numbered tagging approach.
Czech_data_folder/
- Directory containing the Czech discourse raw files.
German_connectives/
- Directory containing the German discourse raw files.
french_corpus_validated/
- Directory containing the french discourse raw files.
best_model_checkpoint/
- Best checkpoint of the finetuned unLLaMA model.
This work leverages two repositories for fine-tuning LLaMA models:
- LS-LLaMA: We fine-tuned the LS-unLLaMA model, designed for label supervision, allowing better performance in token classification tasks. Implementation details are based on the LS-LLaMA repository.
- LLaMA Factory: Another fine-tuning approach was carried out using the LLaMA-Factory repository, which provides flexible instruction tuning for LLaMA models.
- Each token in a sentence is mapped to 0, except for explicit discourse connectives, which receive an identifier corresponding to their discourse relation.
- Example:
Sentence: "I was feeling tired; however, I decided to finish my work." Annotated: 0 0 0 0 3 0 0 0 0
- Here, the connective "however" is tagged with 3, representing a contrast relation.
- Explicit discourse connectives are enclosed in square brackets
[]
and labeled with their discourse relation. - Example:
Sentence: "I was feeling tired; however, I decided to finish my work." Annotated: "I was feeling tired; [however] (contrast), I decided to finish my work."
We compare both annotation approaches by fine-tuning LLaMA models on these datasets and evaluating their performance on discourse connective identification and relation classification. The evaluation notebooks analyze:
- Precision, Recall, and F1-score for both approaches.
- Error analysis of misclassified connectives.
If you use this code in your research, please consider citing:
- LS-LLaMA: Label-Supervised Fine-Tuning for LLaMA
- LLaMA-Factory: Fine-Tuning for Large Language Models
For any questions regarding this repository, please feel free to reach out or open an issue.