Skip to content

Fine-tuning LLaMA models for explicit discourse parsing, focusing on discourse connective detection and relation classification using Numbered Tagging and Bracket Annotation. Includes preprocessing, training, and evaluation on PDTB and multilingual corpora.

Notifications You must be signed in to change notification settings

Mahmoud-1995/Explicit-Discourse-Parsing-Using-LLMS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Explicit-Discourse-Parsing-Using-LLMS

Overview

This repository contains the code and data preprocessing scripts for our Master’s thesis project, which focuses on explicit discourse parsing using fine-tuned LLaMA models. The primary goal is to detect discourse connectives and classify their relations using two different annotation approaches: Numbered Tagging and Bracket Annotation.

Our methodology involves transforming the Penn Discourse TreeBank (PDTB) dataset into structured formats suitable for model training, followed by fine-tuning LLaMA models on these transformed datasets and evaluating their performance.

Repository Structure

  • Preprocessing_PDTB_into_Numbered_Tagging-Bracket_Annotation.ipynb
    • This notebook processes raw PDTB text files into Numbered Tagging and Bracket Annotation formats, making them suitable for fine-tuning the LLaMA models.
    • Contains exploration of the PCC German discourse corpus, FTDB French discourse corpus, and Czech discourse corpus.
  • LLamafactory_evaluating_model_prediction.ipynb
    • This notebook evaluates the performance of the fine-tuned LlaMA model using the bracketing approach.
  • Unllama_Model_Evaluation.ipynb
    • This notebook evaluates the LS-unLLaMA fine-tuned model using the numbered tagging approach.
  • Czech_data_folder/
    • Directory containing the Czech discourse raw files.
  • German_connectives/
    • Directory containing the German discourse raw files.
  • french_corpus_validated/
    • Directory containing the french discourse raw files.
  • best_model_checkpoint/
    • Best checkpoint of the finetuned unLLaMA model.

Fine-Tuning Methodologies

This work leverages two repositories for fine-tuning LLaMA models:

  • LS-LLaMA: We fine-tuned the LS-unLLaMA model, designed for label supervision, allowing better performance in token classification tasks. Implementation details are based on the LS-LLaMA repository.
  • LLaMA Factory: Another fine-tuning approach was carried out using the LLaMA-Factory repository, which provides flexible instruction tuning for LLaMA models.

Annotation Approaches

1. Numbered Tagging

  • Each token in a sentence is mapped to 0, except for explicit discourse connectives, which receive an identifier corresponding to their discourse relation.
  • Example:
    Sentence: "I was feeling tired; however, I decided to finish my work."
    Annotated: 0 0 0 0 3 0 0 0 0
    
  • Here, the connective "however" is tagged with 3, representing a contrast relation.

2. Bracket Annotation

  • Explicit discourse connectives are enclosed in square brackets [] and labeled with their discourse relation.
  • Example:
    Sentence: "I was feeling tired; however, I decided to finish my work."
    Annotated: "I was feeling tired; [however] (contrast), I decided to finish my work."
    

Evaluation

We compare both annotation approaches by fine-tuning LLaMA models on these datasets and evaluating their performance on discourse connective identification and relation classification. The evaluation notebooks analyze:

  • Precision, Recall, and F1-score for both approaches.
  • Error analysis of misclassified connectives.

Citation

If you use this code in your research, please consider citing:

Contact

For any questions regarding this repository, please feel free to reach out or open an issue.

About

Fine-tuning LLaMA models for explicit discourse parsing, focusing on discourse connective detection and relation classification using Numbered Tagging and Bracket Annotation. Includes preprocessing, training, and evaluation on PDTB and multilingual corpora.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages