ALQAC24

This is a repository of Team: se7enese at ALQAC of KSE 2024. ALQAC, which is Automated Legal Question Answering Competition, includes two tasks:

Legal Document Retrieval.
Legal Question Answering.

Member: Hoang-Bao Le.

Affiliation: Dublin City University.

For further information, please visit here.

Methodology

Pipeline:

Preprocessing corpus → Retrieval model → Update training data → Pretrained model → Preprocessing answer

Preprocessing data:

Stage 1 (corpus)
- Remove “\n\n” character.
- Keep the first sentence as the topic sentence for the following article.
Stage 2 (update)
- Add “related article”
- Format the input as the following instruction:
You are a helpful Vietnamese legal assistant with the mission of answering the question based on the given article without explanation.

### Article: {article}

### Question: {question}

{choices}

### Answer: {answer}

Only add the {choices} part in case the question_type is “Trắc nghiệm”.

Stage 3 (final answer)
- Keep the first word (answer): “Đúng” and “Sai” for “Đúng/Sai” question; option (A, B, C, or D) for multiple choice question. For the “Tự luận” question, we
- Manually check with the generated answer of the model to fill in the correct answer.
Retrieval model:

We implemented the retrieval model in three different ways:
- The first idea is using BM25 combined with attention.
- The second idea is using BM25 combined with attention and CNN networks.
- The third idea is using new BM25s.
Pretrained model:
- To enhance the knowledge of the dataset size, we concatenate two train sets into a new one added the related article and propose a prompt to put every useful information into an input for the model.
- We also apply LoRA for fine-tuning model. Here is our experiment details. We fine-tune the model in 1, 3, and 5 epochs and report it in the following section.
- Finally, after the inference stage, we should manually process the generated answer again to remove the irrelevant parts and adjust the answer suitably.

Data

For the training data, we combine two sets train and unverified train to extend the diversity of itself (named as “Total train” in the table below).

	Đúng/Sai	Trắc nghiệm	Tự luận
Train	50	40	10
Unverified train	208	173	49
Total train	258	213	59
Public test	132	76	0
Private test	48	43	9

How to run?

For BM25s and non-fine-tuning LLMs, please run the bash file scripts/run.sh:

#MODEL="./Vistral-7B-Chat"
#MODEL="./Meta-Llama-3-8B-Instruct"
#MODEL="NousResearch/Llama-2-7b-chat-hf"
#MODEL="chillies/vistral-legal-chat-q4"
MODEL="chillies/vinallama-legal-chat"
DATAPATH="./data/public_test.json"

python3 main.py \
	--model_id $MODEL \
	--file $DATAPATH

However, this file is only run with 1 GPU. If you aim to use multi-gpus, we highly recommend use torchrun instead of deepspeed.

For fine-tuning the LLMs, please run the bash file scripts/lora_finetune.sh:

BASE_DIR=./
MODEL=Meta-Llama-3-8B-Instruct
CONFIG=${BASE_DIR}/scripts/zero3_offload.json
OUTDIR=${BASE_DIR}/ckpt/${MODEL}
TRAIN_FILE=${BASE_DIR}/data/train_updated.json
BATCHSIZE=4
EPOCH=5

export CUDA_VISIBLE_DEVICES=0,1

torchrun --nnodes=1 --nproc_per_node=2 --master_port=25035 \
	${BASE_DIR}/train.py \
	--model_name_or_path ./${MODEL} \
	--data_path ${TRAIN_FILE} \
	--lora_enable True\
	--lora_r 16 \
	--lora_alpha 16 \
	--lora_dropout 0.05 \
	--dataloader_num_workers 8 \
	--fp16 \
	--output_dir ${OUTDIR}_lora_$EPOCH-epo_1.0 \
	--per_device_train_batch_size ${BATCHSIZE} \
	--gradient_accumulation_steps 1 \
	--num_train_epochs $EPOCH \
	--fp16 False \
	--save_strategy "steps" \
	--save_total_limit 1 \
	--learning_rate 2e-5 \
	--warmup_ratio 0.03 \
	--lr_scheduler_type "cosine" \
	--logging_steps 1 \
	--logging_dir "$OUTDIR" \
	--report_to wandb \
	--run_name LoraFT_Llama3_$EPOCH-epo_1.0

Results

Document Retrieval

Model	k	Pre	Rec	F2
bm25 attention	0.84	71.15	69.71	69.87
	1.5	62.98	61.78	61.91
	0	63.46	62.26	62.39
bm25 attention cnn	0.5	63.46	x	x
bm25s	x	72.16	x	x

Fine-tuned LLaMA3 (version 8B Instruct) and Inference

Epoch(s)	Accuracy	Note
1	75	Keep the first word
3	81.73	Keep the first word
5	84.14	+ Rule base for special cases

Submissions

Task 1

File name (.json)	Description
bm25_attention	Combining the score BM25 and Attention score between corpus and query, then ranking the score to choose the top-k.
bm25_attention_cnn	Combining the score BM25 and Attention score between corpus and query, however, we offer CNN layers directly into the attention process then ranking the score to choose the top-k.
bm25s	Applying the bm25s without adding any element else.

Task 2

File name (.json)	Description
1epo	Fine-tuning with LoRA Adapter and Total train set in 1 epoch.
3epo	Fine-tuning with LoRA Adapter and Total train set in 3 epochs.
5epo	Fine-tuning with LoRA Adapter and Total train set in 5 epochs.

Both 3 models above are fine-tuned with the same hyperparameters with 1 GPU A100.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
scripts		scripts
README.md		README.md
answer_process.py		answer_process.py
final_inference.py		final_inference.py
inference.py		inference.py
ir.py		ir.py
llm.py		llm.py
main.py		main.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ALQAC24

Methodology

Data

How to run?

Results

Submissions

About

Uh oh!

Releases

Packages

Uh oh!

Languages

baohl00/alqac24

Folders and files

Latest commit

History

Repository files navigation

ALQAC24

Methodology

Data

How to run?

Results

Submissions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages