Welcome to the Official Repository of Passage Injection!
This repository contains the code, datasets, and models used in our paper: Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation.
Passage Injection is a simple yet effective method that explicitly incorporates retrieved passages into LLMs' reasoning process to enhance robustness against noisy information and improve RAG performance.
conda create -n passage_injection python=3.11.2
conda activate passage_injection
pip install vllm==0.8.5
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.2.post1/flash_attn-2.7.2.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
You can directly use our processed data files in the datasets/
folder, which contain the top-10 retrieved passages for each question.
If you want to retrieve passages by yourself, please follow the steps below (adapted from PRAG).
PopQA
Download the PopQA dataset from its repository https://github.com/AlexTMallen/adaptive-retrieval/blob/main/data/popQA.tsv, and put the file popQA.tsv
into folder data/popqa
.
ComplexWebQuestions
Download the ComplexWebQuestions dataset from its repository https://www.dropbox.com/scl/fo/nqujvpg2gc4y0ozkw3wgr/AOzjVEsdUhv2Fx2pamfJlSw?rlkey=746t7xehfqxf1zr867nxiq8aq&e=1, and put the file ComplexWebQuestions_dev.json
into folder data/complexwebquestions
.
2WikiMultihopQA:
Download the 2WikiMultihopQA dataset from its repository https://www.dropbox.com/s/ms2m13252h6xubs/data_ids_april7.zip?e=1. Unzip it and move the folder to data/2wikimultihopqa
.
HotpotQA
Download the HotpotQA dataset with the following command:
mkdir -p data/hotpotqa
wget -P data/hotpotqa/ http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_dev_distractor_v1.json
-
Download the Wikipedia dump from the DPR repository using the following command:
mkdir -p data/dpr wget -O data/dpr/psgs_w100.tsv.gz https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz pushd data/dpr gzip -d psgs_w100.tsv.gz popd
-
Use Elasticsearch to index the Wikipedia dump:
cd data wget -O elasticsearch-8.15.0.tar.gz https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.15.0-linux-x86_64.tar.gz # download Elasticsearch tar zxvf elasticsearch-8.15.0.tar.gz rm elasticsearch-8.15.0.tar.gz cd elasticsearch-8.15.0 nohup bin/elasticsearch & # run Elasticsearch in background cd ../.. python prep_elastic.py --data_path data/dpr/psgs_w100.tsv --index_name wiki # build index
-
Run the following command to retrieve passages for each dataset:
python src/prepare.py --dataset popqa --topk 10 python src/prepare.py --dataset complexwebquestions --topk 10 python src/prepare.py --dataset 2wikimultihopqa --topk 10 python src/prepare.py --dataset hotpotqa --topk 10
The following commands evaluate the performance of Passage Injection and other RAG baselines using top-k retrieved passages. Models should be placed in the models/
directory.
# generate predictions for multiple RAG methods
python src/inference.py --model_name Qwen3-32B --topk 5
# calculate metrics for the predictions
python src/evaluate.py --model_name Qwen3-32B --topk 5
Below are commands for additional experiments. The --further_type
argument controls the type of injected passages:
random_noise
: inject random irrelevant passagescf_noise
: inject counterfactual noisy passagesgold
: inject gold (ground-truth) passages
# generate predictions with random noise
python src/infer_further.py --model_name Qwen3-32B --further_type random_noise
# calculate metrics for the predictions
python src/evaluate.py --model_name Qwen3-32B --further_type random_noise