Setup

DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models (Bae. et al., 2025)

Our paper was accepted to NAACL 2025.
Full paper is available here: https://aclanthology.org/2025.naacl-long.624/

Abstract: While Large Language Models (LLMs) excel in zero-shot Question Answering (QA), they tend to expose biases in their internal knowledge when faced with socially sensitive questions, leading to a degradation in performance. Existing zero-shot methods are efficient but fail to consider context and prevent bias propagation in the answers.
To address this, we propose DeCAP, a method for debiasing LLMs using Context-Adaptive Prompt Generation. DeCAP leverages a Question Ambiguity Detection to take appropriate debiasing actions based on the context and a Neutral Answer Guidance Generation to suppress the LLMs make objective judgments about the context, minimizing the propagation of bias from their internal knowledge. Our various experiments across eight LLMs show that DeCAP achieves state-of-the-art zero-shot debiased QA performance. This demonstrates DeCAP's efficacy in enhancing the fairness and accuracy of LLMs in diverse QA settings.

Setup

Step1: Environmental Setup

Our code runs on CUDA Version 12.7 with a GeForce RTX 3090 (24GB).

git clone https://github.com/BaeSuyoung/DeCAP.git
conda create -n decap python=3.9
conda activate decap
pip install -r requirements.txt

Step2: Prompt Generation

This process generate (1) prefix instruction and (2) next answer guidance.
The results are saved in the dataset/bbq/ours-total.csv and dataset/unqover/ours-total.csv
You can also generate baseline prompts (by changing exp_type into base, retrieved and random, respectively).

cd model
bash prompt_generation.sh

Ours (DeCAP)

## inference_gpu.sh

# bbq dataset
for exp_type in ours
do
    for model in llama3_8B_instruct
    do
        for dataset_name in bbq unqover
        do
            command="CUDA_VISIBLE_DEVICES=0 python src/prompt_generation.py \
                        --experiment_type $exp_type \
                        --dataset_name $dataset_name \
                        --generation_model $model \
                        --sample_num 100 \
                        --batch_size 32 \
                        --seed 77"
            echo $command
            eval $command
        done
    done
done

# unqover dataset
for exp_type in ours
do
    for model in llama3_8B_instruct
    do
        for dataset_name in unqover
        do
            command="CUDA_VISIBLE_DEVICES=0 python src/prompt_generation.py \
                        --experiment_type $exp_type \
                        --dataset_name $dataset_name \
                        --generation_model $model \
                        --sample_num 800 \
                        --batch_size 32 \
                        --seed 77"
            echo $command
            eval $command
        done
    done
done

Step2: Prediction

You can adaptively choose experiment type (ext_type), evaluation model (model), and dataset name (dataset_name)
This process iterates three times with different seeds (e.g., 77, 78, 79) and takes the average as the final result.

cd model
bash inference_gpu.sh

Ours (DeCAP)

## prompt_generation.sh

# bbq dataset
for exp_type in ours
do
    for model in flan_t5_11B flan_t5_11B llama2_7B llama2_7B_chat llama2_13B llama2_13B_chat llama3_8B llama3_8B_instruct
    do
        for dataset_name in bbq
        do
            command="CUDA_VISIBLE_DEVICES=0 python src/inference_gpu.py \
                        --experiment_type $exp_type \
                        --dataset_name $dataset_name \
                        --model_name $model \
                        --batch_size 32 \
                        --seed 77"
            echo $command
            eval $command
        done
    done
done


# unqover dataset
for exp_type in ours
do
    for model in flan_t5_11B flan_t5_11B llama2_7B llama2_7B_chat llama2_13B llama2_13B_chat llama3_8B llama3_8B_instruct
    do
        for dataset_name in unqover
        do
            command="CUDA_VISIBLE_DEVICES=0 python src/inference_gpu.py \
                        --experiment_type $exp_type \
                        --dataset_name $dataset_name \
                        --model_name $model \
                        --batch_size 32 \
                        --seed 77"
            echo $command
            eval $command
        done
    done
done

To run inference with other LLMs, you have to add model tags to the MODEL_CARD in prompt.py.
The results are saved in the result/ folder.

Citation

@inproceedings{bae-etal-2025-decap,
    title = "{D}e{CAP}: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models",
    author = "Bae, Suyoung  and
      Choi, YunSeok  and
      Lee, Jee-Hyong",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.624/",
    doi = "10.18653/v1/2025.naacl-long.624",
    pages = "12555--12574",
    ISBN = "979-8-89176-189-6",
    abstract = "While Large Language Models (LLMs) excel in zero-shot Question Answering (QA), they tend to expose biases in their internal knowledge when faced with socially sensitive questions, leading to a degradation in performance. Existing zero-shot methods are efficient but failto consider context and prevent bias propagation in the answers. To address this, we propose *DeCAP*, a method for debiasing LLMs usingContext-Adaptive Prompt Generation. *DeCAP* leverages a *Question Ambiguity Detection* to take appropriate debiasing actions based on the context and a *Neutral Answer Guidance Generation* to suppress the LLMs make objective judgments about the context, minimizing thepropagation of bias from their internal knowledge. Our various experiments across eight LLMs show that *DeCAP* achieves state-of-the-art zero-shot debiased QA performance. This demonstrates *DeCAP*{'}s efficacy in enhancing the fairness and accuracy of LLMs in diverseQA settings."
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
images		images
model		model
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models (Bae. et al., 2025)

Setup

Step1: Environmental Setup

Step2: Prompt Generation

Step2: Prediction

Citation

About

Uh oh!

Releases

Packages

BaeSuyoung/DeCAP

Folders and files

Latest commit

History

Repository files navigation

DeCAP: Context-Adaptive Prompt Generation for Debiasing Zero-shot Question Answering in Large Language Models (Bae. et al., 2025)

Setup

Step1: Environmental Setup

Step2: Prompt Generation

Step2: Prediction

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages