Skip to content

Official code for the paper Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation.

Notifications You must be signed in to change notification settings

mh-tang/Passage-Injection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Passage Injection

Welcome to the Official Repository of Passage Injection!

This repository contains the code, datasets, and models used in our paper: Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation.

Passage Injection is a simple yet effective method that explicitly incorporates retrieved passages into LLMs' reasoning process to enhance robustness against noisy information and improve RAG performance.

Overall Method

Reproduce Paper Results

Install Environment

conda create -n passage_injection python=3.11.2
conda activate passage_injection
pip install vllm==0.8.5
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.2.post1/flash_attn-2.7.2.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl

Prepare Data

You can directly use our processed data files in the datasets/ folder, which contain the top-10 retrieved passages for each question.

If you want to retrieve passages by yourself, please follow the steps below (adapted from PRAG).

Download Datasets

PopQA

Download the PopQA dataset from its repository https://github.com/AlexTMallen/adaptive-retrieval/blob/main/data/popQA.tsv, and put the file popQA.tsv into folder data/popqa.

ComplexWebQuestions

Download the ComplexWebQuestions dataset from its repository https://www.dropbox.com/scl/fo/nqujvpg2gc4y0ozkw3wgr/AOzjVEsdUhv2Fx2pamfJlSw?rlkey=746t7xehfqxf1zr867nxiq8aq&e=1, and put the file ComplexWebQuestions_dev.json into folder data/complexwebquestions.

2WikiMultihopQA:

Download the 2WikiMultihopQA dataset from its repository https://www.dropbox.com/s/ms2m13252h6xubs/data_ids_april7.zip?e=1. Unzip it and move the folder to data/2wikimultihopqa.

HotpotQA

Download the HotpotQA dataset with the following command:

mkdir -p data/hotpotqa
wget -P data/hotpotqa/ http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_dev_distractor_v1.json

Retrieve Passages

  1. Download the Wikipedia dump from the DPR repository using the following command:

    mkdir -p data/dpr
    wget -O data/dpr/psgs_w100.tsv.gz https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
    pushd data/dpr
    gzip -d psgs_w100.tsv.gz
    popd
  2. Use Elasticsearch to index the Wikipedia dump:

    cd data
    wget -O elasticsearch-8.15.0.tar.gz https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.15.0-linux-x86_64.tar.gz  # download Elasticsearch
    tar zxvf elasticsearch-8.15.0.tar.gz
    rm elasticsearch-8.15.0.tar.gz 
    cd elasticsearch-8.15.0
    nohup bin/elasticsearch &  # run Elasticsearch in background
    cd ../..
    python prep_elastic.py --data_path data/dpr/psgs_w100.tsv --index_name wiki  # build index
  3. Run the following command to retrieve passages for each dataset:

    python src/prepare.py --dataset popqa --topk 10
    python src/prepare.py --dataset complexwebquestions --topk 10
    python src/prepare.py --dataset 2wikimultihopqa --topk 10
    python src/prepare.py --dataset hotpotqa --topk 10

Run Passage Injection

The following commands evaluate the performance of Passage Injection and other RAG baselines using top-k retrieved passages. Models should be placed in the models/ directory.

# generate predictions for multiple RAG methods
python src/inference.py --model_name Qwen3-32B --topk 5

# calculate metrics for the predictions
python src/evaluate.py --model_name Qwen3-32B --topk 5

Below are commands for additional experiments. The --further_type argument controls the type of injected passages:

  • random_noise: inject random irrelevant passages
  • cf_noise: inject counterfactual noisy passages
  • gold: inject gold (ground-truth) passages
# generate predictions with random noise
python src/infer_further.py --model_name Qwen3-32B --further_type random_noise

# calculate metrics for the predictions
python src/evaluate.py --model_name Qwen3-32B --further_type random_noise

About

Official code for the paper Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages