Explicit-Discourse-Parsing-Using-LLMS

Overview

This repository contains the code and data preprocessing scripts for our Master’s thesis project, which focuses on explicit discourse parsing using fine-tuned LLaMA models. The primary goal is to detect discourse connectives and classify their relations using two different annotation approaches: Numbered Tagging and Bracket Annotation.

Our methodology involves transforming the Penn Discourse TreeBank (PDTB) dataset into structured formats suitable for model training, followed by fine-tuning LLaMA models on these transformed datasets and evaluating their performance.

Repository Structure

Preprocessing_PDTB_into_Numbered_Tagging-Bracket_Annotation.ipynb
- This notebook processes raw PDTB text files into Numbered Tagging and Bracket Annotation formats, making them suitable for fine-tuning the LLaMA models.
- Contains exploration of the PCC German discourse corpus, FTDB French discourse corpus, and Czech discourse corpus.
LLamafactory_evaluating_model_prediction.ipynb
- This notebook evaluates the performance of the fine-tuned LlaMA model using the bracketing approach.
Unllama_Model_Evaluation.ipynb
- This notebook evaluates the LS-unLLaMA fine-tuned model using the numbered tagging approach.
Czech_data_folder/
- Directory containing the Czech discourse raw files.
German_connectives/
- Directory containing the German discourse raw files.
french_corpus_validated/
- Directory containing the french discourse raw files.
best_model_checkpoint/
- Best checkpoint of the finetuned unLLaMA model.

Fine-Tuning Methodologies

This work leverages two repositories for fine-tuning LLaMA models:

LS-LLaMA: We fine-tuned the LS-unLLaMA model, designed for label supervision, allowing better performance in token classification tasks. Implementation details are based on the LS-LLaMA repository.
LLaMA Factory: Another fine-tuning approach was carried out using the LLaMA-Factory repository, which provides flexible instruction tuning for LLaMA models.

Annotation Approaches

1. Numbered Tagging

Each token in a sentence is mapped to 0, except for explicit discourse connectives, which receive an identifier corresponding to their discourse relation.

Example:

Sentence: "I was feeling tired; however, I decided to finish my work."
Annotated: 0 0 0 0 3 0 0 0 0

Here, the connective "however" is tagged with 3, representing a contrast relation.

2. Bracket Annotation

Explicit discourse connectives are enclosed in square brackets [] and labeled with their discourse relation.

Example:

Sentence: "I was feeling tired; however, I decided to finish my work."
Annotated: "I was feeling tired; [however] (contrast), I decided to finish my work."

Evaluation

We compare both annotation approaches by fine-tuning LLaMA models on these datasets and evaluating their performance on discourse connective identification and relation classification. The evaluation notebooks analyze:

Precision, Recall, and F1-score for both approaches.
Error analysis of misclassified connectives.

Citation

If you use this code in your research, please consider citing:

Contact

For any questions regarding this repository, please feel free to reach out or open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Czech_data_folder		Czech_data_folder
German_connectives		German_connectives
PDTB3.0_raw		PDTB3.0_raw
best_model_checkpoint		best_model_checkpoint
unllama_prediction_output		unllama_prediction_output
LLamafactory_evaluating_model_prediction.ipynb		LLamafactory_evaluating_model_prediction.ipynb
Preprocessing PDTB into Numbered Tagging - Bracket Annotation.ipynb		Preprocessing PDTB into Numbered Tagging - Bracket Annotation.ipynb
README.md		README.md
Unllama Model Evaluation.ipynb		Unllama Model Evaluation.ipynb
connective_detection_generated_predictions_CoNLL_split.jsonl		connective_detection_generated_predictions_CoNLL_split.jsonl
connective_detection_generated_predictions_Disrpt_split.jsonl		connective_detection_generated_predictions_Disrpt_split.jsonl
french_corpus_validated.xml		french_corpus_validated.xml
positionings.csv		positionings.csv
relation_classification_generated_predictions_CoNLL_split.jsonl		relation_classification_generated_predictions_CoNLL_split.jsonl
relation_classification_generated_predictions_disrpt_split.jsonl		relation_classification_generated_predictions_disrpt_split.jsonl
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Explicit-Discourse-Parsing-Using-LLMS

Overview

Repository Structure

Fine-Tuning Methodologies

Annotation Approaches

1. Numbered Tagging

2. Bracket Annotation

Evaluation

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Mahmoud-1995/Explicit-Discourse-Parsing-Using-LLMS

Folders and files

Latest commit

History

Repository files navigation

Explicit-Discourse-Parsing-Using-LLMS

Overview

Repository Structure

Fine-Tuning Methodologies

Annotation Approaches

1. Numbered Tagging

2. Bracket Annotation

Evaluation

Citation

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages