Skip to content

baohl00/alqac24

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ALQAC24

This is a repository of Team: se7enese at ALQAC of KSE 2024. ALQAC, which is Automated Legal Question Answering Competition, includes two tasks:

  • Legal Document Retrieval.
  • Legal Question Answering.

Member: Hoang-Bao Le.

Affiliation: Dublin City University.

For further information, please visit here.

Methodology

  1. Pipeline:

Preprocessing corpus → Retrieval model → Update training data → Pretrained model → Preprocessing answer

  1. Preprocessing data:

    Stage 1 (corpus)

    • Remove “\n\n” character.
    • Keep the first sentence as the topic sentence for the following article.

    Stage 2 (update)

    • Add “related article”
    • Format the input as the following instruction:

    You are a helpful Vietnamese legal assistant with the mission of answering the question based on the given article without explanation.

    ### Article: {article}

    ### Question: {question}

    {choices}

    ### Answer: {answer}

    Only add the {choices} part in case the question_type is “Trắc nghiệm”.

    Stage 3 (final answer)

    • Keep the first word (answer): “Đúng” and “Sai” for “Đúng/Sai” question; option (A, B, C, or D) for multiple choice question. For the “Tự luận” question, we
    • Manually check with the generated answer of the model to fill in the correct answer.
  2. Retrieval model:

    We implemented the retrieval model in three different ways:

    • The first idea is using BM25 combined with attention.
    • The second idea is using BM25 combined with attention and CNN networks.
    • The third idea is using new BM25s.
  3. Pretrained model:

    • To enhance the knowledge of the dataset size, we concatenate two train sets into a new one added the related article and propose a prompt to put every useful information into an input for the model.
    • We also apply LoRA for fine-tuning model. Here is our experiment details. We fine-tune the model in 1, 3, and 5 epochs and report it in the following section.
    • Finally, after the inference stage, we should manually process the generated answer again to remove the irrelevant parts and adjust the answer suitably.

Data

For the training data, we combine two sets train and unverified train to extend the diversity of itself (named as “Total train” in the table below).

Đúng/Sai Trắc nghiệm Tự luận
Train 50 40 10
Unverified train 208 173 49
Total train 258 213 59
Public test 132 76 0
Private test 48 43 9

How to run?

  • For BM25s and non-fine-tuning LLMs, please run the bash file scripts/run.sh:
#MODEL="./Vistral-7B-Chat"
#MODEL="./Meta-Llama-3-8B-Instruct"
#MODEL="NousResearch/Llama-2-7b-chat-hf"
#MODEL="chillies/vistral-legal-chat-q4"
MODEL="chillies/vinallama-legal-chat"
DATAPATH="./data/public_test.json"

python3 main.py \
	--model_id $MODEL \
	--file $DATAPATH

However, this file is only run with 1 GPU. If you aim to use multi-gpus, we highly recommend use torchrun instead of deepspeed.

BASE_DIR=./
MODEL=Meta-Llama-3-8B-Instruct
CONFIG=${BASE_DIR}/scripts/zero3_offload.json
OUTDIR=${BASE_DIR}/ckpt/${MODEL}
TRAIN_FILE=${BASE_DIR}/data/train_updated.json
BATCHSIZE=4
EPOCH=5

export CUDA_VISIBLE_DEVICES=0,1

torchrun --nnodes=1 --nproc_per_node=2 --master_port=25035 \
	${BASE_DIR}/train.py \
	--model_name_or_path ./${MODEL} \
	--data_path ${TRAIN_FILE} \
	--lora_enable True\
	--lora_r 16 \
	--lora_alpha 16 \
	--lora_dropout 0.05 \
	--dataloader_num_workers 8 \
	--fp16 \
	--output_dir ${OUTDIR}_lora_$EPOCH-epo_1.0 \
	--per_device_train_batch_size ${BATCHSIZE} \
	--gradient_accumulation_steps 1 \
	--num_train_epochs $EPOCH \
	--fp16 False \
	--save_strategy "steps" \
	--save_total_limit 1 \
	--learning_rate 2e-5 \
	--warmup_ratio 0.03 \
	--lr_scheduler_type "cosine" \
	--logging_steps 1 \
	--logging_dir "$OUTDIR" \
	--report_to wandb \
	--run_name LoraFT_Llama3_$EPOCH-epo_1.0

Results

Document Retrieval

Model k Pre Rec F2
bm25 attention 0.84 71.15 69.71 69.87
1.5 62.98 61.78 61.91
0 63.46 62.26 62.39
bm25 attention cnn 0.5 63.46 x x
bm25s x 72.16 x x

Fine-tuned LLaMA3 (version 8B Instruct) and Inference

Epoch(s) Accuracy Note
1 75 Keep the first word
3 81.73 Keep the first word
5 84.14 + Rule base for special cases

Submissions

Task 1

File name (.json) Description
bm25_attention Combining the score BM25 and Attention score between corpus and query, then ranking the score to choose the top-k.
bm25_attention_cnn Combining the score BM25 and Attention score between corpus and query, however, we offer CNN layers directly into the attention process then ranking the score to choose the top-k.
bm25s Applying the bm25s without adding any element else.

Task 2

File name (.json) Description
1epo Fine-tuning with LoRA Adapter and Total train set in 1 epoch.
3epo Fine-tuning with LoRA Adapter and Total train set in 3 epochs.
5epo Fine-tuning with LoRA Adapter and Total train set in 5 epochs.

Both 3 models above are fine-tuned with the same hyperparameters with 1 GPU A100.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published