Skip to content

BaeSuyoung/DeCAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models (Bae. et al., 2025)

Our paper was accepted to NAACL 2025.
Full paper is available here: https://aclanthology.org/2025.naacl-long.624/

Abstract: While Large Language Models (LLMs) excel in zero-shot Question Answering (QA), they tend to expose biases in their internal knowledge when faced with socially sensitive questions, leading to a degradation in performance. Existing zero-shot methods are efficient but fail to consider context and prevent bias propagation in the answers.
To address this, we propose DeCAP, a method for debiasing LLMs using Context-Adaptive Prompt Generation. DeCAP leverages a Question Ambiguity Detection to take appropriate debiasing actions based on the context and a Neutral Answer Guidance Generation to suppress the LLMs make objective judgments about the context, minimizing the propagation of bias from their internal knowledge. Our various experiments across eight LLMs show that DeCAP achieves state-of-the-art zero-shot debiased QA performance. This demonstrates DeCAP's efficacy in enhancing the fairness and accuracy of LLMs in diverse QA settings.

Setup

Step1: Environmental Setup

  • Our code runs on CUDA Version 12.7 with a GeForce RTX 3090 (24GB).
git clone https://github.com/BaeSuyoung/DeCAP.git
conda create -n decap python=3.9
conda activate decap
pip install -r requirements.txt

Step2: Prompt Generation

  • This process generate (1) prefix instruction and (2) next answer guidance.
  • The results are saved in the dataset/bbq/ours-total.csv and dataset/unqover/ours-total.csv
  • You can also generate baseline prompts (by changing exp_type into base, retrieved and random, respectively).
cd model
bash prompt_generation.sh
  • Ours (DeCAP)
## inference_gpu.sh

# bbq dataset
for exp_type in ours
do
    for model in llama3_8B_instruct
    do
        for dataset_name in bbq unqover
        do
            command="CUDA_VISIBLE_DEVICES=0 python src/prompt_generation.py \
                        --experiment_type $exp_type \
                        --dataset_name $dataset_name \
                        --generation_model $model \
                        --sample_num 100 \
                        --batch_size 32 \
                        --seed 77"
            echo $command
            eval $command
        done
    done
done

# unqover dataset
for exp_type in ours
do
    for model in llama3_8B_instruct
    do
        for dataset_name in unqover
        do
            command="CUDA_VISIBLE_DEVICES=0 python src/prompt_generation.py \
                        --experiment_type $exp_type \
                        --dataset_name $dataset_name \
                        --generation_model $model \
                        --sample_num 800 \
                        --batch_size 32 \
                        --seed 77"
            echo $command
            eval $command
        done
    done
done

Step2: Prediction

  • You can adaptively choose experiment type (ext_type), evaluation model (model), and dataset name (dataset_name)
  • This process iterates three times with different seeds (e.g., 77, 78, 79) and takes the average as the final result.
cd model
bash inference_gpu.sh
  • Ours (DeCAP)
## prompt_generation.sh

# bbq dataset
for exp_type in ours
do
    for model in flan_t5_11B flan_t5_11B llama2_7B llama2_7B_chat llama2_13B llama2_13B_chat llama3_8B llama3_8B_instruct
    do
        for dataset_name in bbq
        do
            command="CUDA_VISIBLE_DEVICES=0 python src/inference_gpu.py \
                        --experiment_type $exp_type \
                        --dataset_name $dataset_name \
                        --model_name $model \
                        --batch_size 32 \
                        --seed 77"
            echo $command
            eval $command
        done
    done
done


# unqover dataset
for exp_type in ours
do
    for model in flan_t5_11B flan_t5_11B llama2_7B llama2_7B_chat llama2_13B llama2_13B_chat llama3_8B llama3_8B_instruct
    do
        for dataset_name in unqover
        do
            command="CUDA_VISIBLE_DEVICES=0 python src/inference_gpu.py \
                        --experiment_type $exp_type \
                        --dataset_name $dataset_name \
                        --model_name $model \
                        --batch_size 32 \
                        --seed 77"
            echo $command
            eval $command
        done
    done
done

  • To run inference with other LLMs, you have to add model tags to the MODEL_CARD in prompt.py.
  • The results are saved in the result/ folder.

Citation

@inproceedings{bae-etal-2025-decap,
    title = "{D}e{CAP}: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models",
    author = "Bae, Suyoung  and
      Choi, YunSeok  and
      Lee, Jee-Hyong",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.624/",
    doi = "10.18653/v1/2025.naacl-long.624",
    pages = "12555--12574",
    ISBN = "979-8-89176-189-6",
    abstract = "While Large Language Models (LLMs) excel in zero-shot Question Answering (QA), they tend to expose biases in their internal knowledge when faced with socially sensitive questions, leading to a degradation in performance. Existing zero-shot methods are efficient but failto consider context and prevent bias propagation in the answers. To address this, we propose *DeCAP*, a method for debiasing LLMs usingContext-Adaptive Prompt Generation. *DeCAP* leverages a *Question Ambiguity Detection* to take appropriate debiasing actions based on the context and a *Neutral Answer Guidance Generation* to suppress the LLMs make objective judgments about the context, minimizing thepropagation of bias from their internal knowledge. Our various experiments across eight LLMs show that *DeCAP* achieves state-of-the-art zero-shot debiased QA performance. This demonstrates *DeCAP*{'}s efficacy in enhancing the fairness and accuracy of LLMs in diverseQA settings."
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published