Skip to content

The official code of our paper “RAG-Critic: Leveraging Automated Critic-Guided Agentic Workflow for Retrieval Augmented Generation”

License

Notifications You must be signed in to change notification settings

RUC-NLPIR/RAG-Critic

Repository files navigation

rag-critic

RAG-Critic: Leveraging Automated Critic-Guided Agentic Workflow for Retrieval Augmented Generation


Guanting DongJiajie JinXiaoxi LiYutao ZhuZhicheng Dou ; Ji-rong Wen

Gaoling School of Artificial Intelligence, Renmin University of China.
Corresponding Author

📌 Release

[05/2025] Our paper has been accepted at ACL 2025 Main conference!

[03/2025] We present a demo of RAG-Critic to showcase its efficient critic capabilities, try demo.ipynb!

[03/2025] We release our huggingface dataset 🤗RAG-Error-Critic-100K and Critic model 🤗RAG-Critic-3B

[03/2025] Code and paper are publicly available.

🔧 1. General Setup

Dependencies

  • Python 3.10.13
  • PyTorch (currently tested on version 2.5.1+cu124)
  • Transformers (version 4.47.1, unlikely to work below this version)
  • vLLM (version 0.6.6.post1)
pip install -r requirements.txt

📝 2. Dataset Preparation

Since the Critic Agent is built on FlashRAG framework, you need to install FlashRAG first.

# install flashrag
pip install flashrag-dev[full] --pre
# install faiss
conda install -c pytorch faiss-cpu=1.8.0

Retrieve the top relevant Wikipedia passages using E5-base-v2 for 9 RAG-related datasets, stored in the ./dataset_pool_retrieved_top10/${name} directory. You can find the train/dev/test sets of preprocessed datasets with the top 5 retrieved passages here. We specify ${dataset} for 9 datasets: ['nq', 'triviaqa', 'hotpotqa', '2wikimultihopqa', 'wikiasp', 'eli5', 'asqa', 'fever', 'wow'] in the following example commands.

✨ Hierarchical Error System Construction

We introduce our RAG error categorization system and a three-step pipeline for error response annotation.

Overview

As shown in the image below, we have a total of 7 first-tier labels, 19 second-tier labels, and over 2000 third-tier labels. Here are the details:

image
  1. Hierarchical Error System (here).
  2. Frequency error statistics of labels (here).
🔍 Click here! If you want to reproduce our RAG error response mining and annotation. image

Step 1: Error Response Sampling

First, please download the sampling models from Hugging Face (refer to Appendix Table 9 -- 15 models), and place these model names in the models parameter. Then, perform comprehensive response sampling on the 9 RAG-related datasets:

cd ./error_system_construction/
bash sample.sh

The output data will be saved at error_sampling_results/responses_${model}_${dataset}_train_1w.json.

Step 2: Critical Annotation & Tagging

  1. Critical Annotation
    Analyze the reasons for errors using the strong supervision model (Qwen2.5-72B) on the sampled data containing Chain of Thought responses. First, download the sampling models from Hugging Face (refer to Appendix Table 9 -- 15 models), then perform comprehensive response sampling on the 9 RAG-related datasets:

    cd ./error_system_construction/
    bash critic.sh

    The source data and error analysis will be saved at error_critic_results/critic_${model}_${dataset}_train_1w.json.

  2. Tagging
    Inspired by the Instag prompt template, we further annotate the RAG error analysis results with fine-grained, open-set labels:

    cd ./error_system_construction/
    bash error_tag.sh

    The sampled open-set tags will be saved at error_critic_results/critic_${model}_${dataset}_train_1w.json.

Step 3: Error Label Summarization

First, please follow the methods in the document to deduplicate and normalize the tag set. Then, refer to the hierarchical clustering method for aggregating RAG error clusters, as detailed in cluster.ipynb.

Furthermore, use GPT-4o and human for higher-level label summarization of the error clusters.

🚗 RAG Error-Critic Alignment

We release our RAG Error-Critic SFT dataset, model weights and demo:

  • SFT Dataset: We synthesize the first fine-grained error identification dataset, 🤗RAG-Error-Critic-100K, by combining responses from 15 models across 9 RAG-related datasets with fine-grained error labels.

  • Model Weights: We released our RAG error identification model 🤗RAG-Critic-3B.

  • Demo: We release the huggingface inference demo of our RAG-Critic model here

The following shows our detailed training procedure:

  • SFT bash:
### model
model_name_or_path: /path/to/model_zoo/model_name

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: /path/to/deepspeed/config.json  # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]

### dataset
dataset: dataset_name
template: template_name
cutoff_len: 4096
max_samples: 100000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /path/to/output/directory
logging_steps: 10
save_steps: 2000
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

For DPO data, please construct it based on our SFT dataset and error system settings (Section 3.2), using the previous version of LlaMA-Factory.

  • Coarse-to-Fine DPO bash:
deepspeed --num_gpus 8 train_bash.py \
        --deepspeed $deepspeed_zero3_config_path \
        --stage dpo \
        --do_train \
        --model_name_or_path $MODEL_PATH \
        --dataset $dataset \
        --dataset_dir $DATA_PATH \
        --template $Template \
        --finetuning_type full \
        --output_dir $OUTPUT_PATH \
        --overwrite_cache \
        --overwrite_output_dir \
        --cutoff_len 4096 \
        --preprocessing_num_workers 1 \
        --per_device_train_batch_size 1 \
        --gradient_accumulation_steps 2 \
        --lr_scheduler_type cosine \
        --logging_steps 10 \
        --warmup_ratio 0.1 \
        --save_steps 1000 \
        --learning_rate 5e-6 \
        --num_train_epochs 2.0 \
        --max_samples 200000 \
        --ddp_timeout 180000000 \
        --plot_loss \
        --fp16

Critic-Guiided Agentic Workflow

Data Preparation

🔍 Click here to reproduce the our test set construction pipeline.

Firstly, using critic agent to obtain required correction path

Step 1: Install Required Frameworks

Since the Critic Agent is built on FlashRAG framework, you need to install FlashRAG first.

# install flashrag
pip install flashrag-dev[full] --pre
# install faiss
conda install -c pytorch faiss-cpu=1.8.0

Step 2: Prepare the Data

The operation of Critic requires the following data:

  1. Retrieved Documents: These contain the retrieval results for each query in the test set (used to generate the original answer retrieval results). The storage path is: {retrieval_data_dir}/{dataset_name}/{split}.json. The format is as follows:
[
    {
        "question": "who sings does he love me with reba",
        "golden_answers": [
            "Linda Davis"
        ],
        "retrieval_docs": [
            {
                "id": "17237290",
                "contents": "\"Does He Love You\"\nDoes He Love You \"\"Does He Love You\"\" is a song written by Sandy Knox and Billy Stritch, and recorded as a duet by American country music artists Reba McEntire and Linda Davis. It was released in August 1993 as the first single from Reba's album \"\"Greatest Hits Volume Two\"\". It is one of country music's several songs about a love triangle. \"\"Does He Love You\"\" was written in 1982 by Billy Stritch. He recorded it with a trio in which he performed at the time, because he wanted a song that could be sung by the other two members"
            },
            {
                "id": "5026369",
                "contents": "\"Linda Davis\"\nLinda Davis Linda Kaye Davis (born November 26, 1962) is an American country music singer. Before beginning a career as a solo artist, she had three minor country singles in the charts as one half of the duo Skip & Linda. In her solo career, Davis has recorded five studio albums for major record labels and more than 15 singles. Her highest chart entry is \"\"Does He Love You\"\", her 1993 duet with Reba McEntire, which reached number one on the \"\"Billboard\"\" country charts and won both singers the Grammy for Best Country Vocal Collaboration. Her highest solo chart position"
            }
        ]
    }
]
  1. The raw answers file generated by the model, stored at: {previous_answer_data_dir}/responses_{model_name}_{dataset_name}_{split}_100.json
  2. The error analysis file annotated by the Critic model, stored at: error_data_dir/errordata_{dataset_name}_{model_name}_{split}.json

The generation of the last two files can refer to the steps for generating model raw answers and running Critic.

Step 3: Prepare the Retriever

Since the Agent needs to perform retrieval during its execution, you need to download the retrieval Corpus and its corresponding Index. The experiment uses the Wiki-dpr-100w file and the corresponding E5 index provided by FlashRAG. The download links are as follows:

Step 4: Fill in the Configuration File

After downloading the necessary files, you need to fill in the file paths into the configuration file required by FlashRAG (myconfig.yaml). The fields that need to be filled are as follows, with other fields filled during the program execution:

  • method2index
  • corpus_path

Usage

plan_agent.py and execute_agent.py respectively provide the Critic's planning and execution. The running scripts are in run_exp.sh. After running, the evaluation results and intermediate variables will be stored in the corresponding folder under save_dir.

You can directly run the following command to execute the Critic Agent:

cd ./critic_agent/
python run_exp.sh

RAG-Error Bench

We introduce the RAG-Error benchmark, aiming to make prediction judgment and fine-grained error recognition.

image

📊 Test Cases

Key-Value Introduction:

  • Input: User query + Top-K Document + LLM's prediction + 1st-tier error tag set + 2nd-tier erro tag set
  • Output: Judgement + 1st-tier error tag sets (selected) + 2nd-tier erro tag sets (selected)
{
    "instruction": 
"You are a critical system designed to provide useful error type tags for retrieval-augmented generation (RAG) tasks. Your goal is to assist in detailed error analysis to improve the performance of AI assistants. Below are the [Question], the top-5 retrieved relevant [Passages], and the [Model's Prediction] for the RAG tasks.\n\n                Question: who wrote the song going to kansas city\n                Passage 1: \"Kansas City (Leiber and Stoller song)\"\nKansas City (Leiber and Stoller song) \"\"Kansas City\"\" is a rhythm and blues song written by Jerry Leiber and Mike Stoller in 1952. First recorded by Little Willie Littlefield the same year, the song later became a #1 hit when it was recorded by Wilbert Harrison in 1959. \"\"Kansas City\"\" became one of Leiber and Stoller's \"\"most recorded tunes, with more than three hundred versions,\"\" with several appearing in the R&B and pop record charts. \"\"Kansas City\"\" was written by Jerry Leiber and Mike Stoller, two nineteen-year-old rhythm and blues fans from Los Angeles, who had their first success writing\n                Passage 2: \"Kansas City (Leiber and Stoller song)\"\n\"\"Eighteenth and Vine\"\" for \"\"12th Street and Vine,\"\" which sings just as well, and recognizes Kansas City's jazz history. Kansas City (Leiber and Stoller song) \"\"Kansas City\"\" is a rhythm and blues song written by Jerry Leiber and Mike Stoller in 1952. First recorded by Little Willie Littlefield the same year, the song later became a #1 hit when it was recorded by Wilbert Harrison in 1959. \"\"Kansas City\"\" became one of Leiber and Stoller's \"\"most recorded tunes, with more than three hundred versions,\"\" with several appearing in the R&B and pop record charts. \"\"Kansas City\"\" was written by Jerry\n                Passage 3: \"Going to Kansas City\"\nGoing to Kansas City Going to Kansas City is a 1998 Canadian-Finnish drama film directed by Pekka Mandart. The film is about a male exchange student from Finland, who falls in love with an American girl, whose father does not accept the relationship. Shot in Canada, the film is set in the rural town of Canaan that is located 120 miles west of Kansas City. The director Pekka Mandart adapted the idea of the story from a magazine article about a love story between a Finnish exchange student and an American girl. Out of a total budget of 17 million\n                Passage 4: \"Wilbert Harrison\"\nWilbert Harrison Wilbert Huntington Harrison (January 5, 1929 – October 26, 1994) was an American rhythm and blues singer, pianist, guitarist and harmonica player. Born in Charlotte, North Carolina, Harrison had a Billboard #1 record in 1959 with the song \"\"Kansas City\"\". The song was written in 1952 and was one of the first credited collaborations by the team of Jerry Leiber and Mike Stoller. It sold over one million copies, and was awarded a gold disc. Harrison recorded \"\"Kansas City\"\" for the Harlem-based entrepreneur Bobby Robinson, who released it on his Fury record label. At the height of the\n                Passage 5: \"Wilbert Harrison\"\nthe North Carolina Music Hall of Fame in 2009. Wilbert Harrison Wilbert Huntington Harrison (January 5, 1929 – October 26, 1994) was an American rhythm and blues singer, pianist, guitarist and harmonica player. Born in Charlotte, North Carolina, Harrison had a Billboard #1 record in 1959 with the song \"\"Kansas City\"\". The song was written in 1952 and was one of the first credited collaborations by the team of Jerry Leiber and Mike Stoller. It sold over one million copies, and was awarded a gold disc. Harrison recorded \"\"Kansas City\"\" for the Harlem-based entrepreneur Bobby Robinson, who released it on\n                Model's Prediction: The song \"Kansas City\" was not written by the individuals or entities mentioned in the passages about \"Going to Kansas City,\" \"Wilbert Harrison,\" or \"Kansas City (Leiber and Stoller song).\" The song \"Kansas City\" was written by Jerry Leiber and Mike Stoller in 1952.\n\n                Please first determine whether the model's prediction is correct. If it is correct, output it as follows:\n                {\"Judgement\": \"Correct\"}\n\n                If it is incorrect, give these error types, tag1 corresponds to tag2 one-to-one:\n                        tag1= [\n                                \"Incomplete Information\",\n                                \"Incomplete Information\",\n                                \"Irrelevant Information\",\n                                \"Irrelevant Information\",\n                                \"Erroneous Information\",\n                                \"Incomplete or Missing Response\",\n                                \"Incomplete or Missing Response\",\n                                \"Incomplete or Missing Response\",\n                                \"Incomplete or Missing Response\",\n                                \"Incomplete or Missing Response\",\n                                \"Inaccurate or Misunderstood Response\",\n                                \"Inaccurate or Misunderstood Response\",\n                                \"Inaccurate or Misunderstood Response\",\n                                \"Inaccurate or Misunderstood Response\",\n                                \"Inaccurate or Misunderstood Response\",\n                                \"Inaccurate or Misunderstood Response\",\n                                \"Irrelevant or Off-Topic Response\",\n                                \"Irrelevant or Off-Topic Response\",\n                                \"Irrelevant or Off-Topic Response\",\n                                \"Overly Verbose Response\"\n                            ]\n\n                            tag2 = [\n                                \"Insufficient or Incomplete Information Retrieval\",\n                                \"Data Insufficiency in Retrieval\",\n                                \"Relevance Gaps in Retrieval\",\n                                \"Irrelevant Information Retrieval\",\n                                \"Erroneous Information Retrieval\",\n                                \"Omission of Key Information\",\n                                \"Lack of Specificity\",\n                                \"Specificity and Precision Errors\",\n                                \"Partial Coverage and Temporal Issues\",\n                                \"Lack of Practicality\",\n                                \"Contextual Understanding Errors\",\n                                \"Factual Inaccuracies\",\n                                \"Incorrect and Incomplete Answers\",\n                                \"Golden Answer Misalignment\",\n                                \"Misinterpretation of Queries and Information\",\n                                \"Entity and Concept Confusion\",\n                                \"Irrelevant Content and Topic Drift\",\n                                \"Off-Topic and Redundant Responses\",\n                                \"Content and Context Misalignment\",\n                                \"Overly Complex and Redundant Response\"\n                            ]\n                 Please identify the error tags at three levels, from coarse to fine, and provide a detailed error analysis. Adhere strictly to the following JSON format:\n                {\n                    \"Judgement\": \"Error\",\n                    \"Error_analysis\": \"\",\n                    \"tag1\": [],\n                    \"tag2\": [],\n                    \"tag3\": []\n                }\n                ",
    "input": "",
    "output": "{'Judgement': 'Error', 'Error_analysis': \"The model's prediction is incorrect because it does not directly answer the question about who wrote the song 'Going to Kansas City.' Instead, it provides information about the song 'Kansas City,' which is a different song. The passages provided do not contain information about the song 'Going to Kansas City,' and thus the model should have indicated that the information to answer the question is not available in the provided passages.\", 'tag1': ['Incomplete or Missing Response', 'Inaccurate or Misunderstood Response', 'Incomplete Information'], 'tag2': ['Entity and Concept Confusion', 'Lack of Specificity', 'Insufficient or Incomplete Information Retrieval', 'Contextual Understanding Errors'], 'tag3': ['Relevance Error', 'Contextual Understanding Error', 'Information Retrieval Failure', 'Specificity Error']}",
    "history": [

    ]
}

🔑 Inference

You first need to perform inference on RAG-Error bench, and the command is as follows:

cd ./rag_error_bench/

# Evaluate open-sourced LLM
bash test_open_llm.sh

# Evaluate close-source LLM like GPT-4o, Deepseek R1 and Claude 3.5
bash test_close_llm.sh 

The format of each sample in your RAG-Critic/rag_error_bench/test_data/baseline_test.json are in the following form:

image

📝 Evaluation

After completing the inference, run the evaluation script:

python ./rag_error_bench/caculate_acc.py

Note that you need to replace the input and output sections of caculate_acc.py.

🔍 Here, we provide detailed evaluation metric results of the RAG-Error bench in the following format.
{
  "overall": {
    "accuracy": 0.1194,  #Overall Acc.
    "f1": 0.1781,    #Overall F1
    "rouge": {
      "rouge-1": 0.4707,
      "rouge-2": 0.2361,
      "rouge-l": 0.4359
    },
    "judgement_accuracy": 0.6895,  #Overall Judgment
    "correct_judgement_accuracy": 0.9526,   #Overall judgement of correct prediction
    "tag1": {
      "accuracy": 0.1741,  #Overall Tag1 Acc.
      "f1": 0.2567,    #Overall Tag2 F1 Acc.
      "rouge": {
        "rouge-1": 0.4707,
        "rouge-2": 0.2361,
        "rouge-l": 0.4359
      },
      "judgement_accuracy": 0.4221   #Overall judgement of error prediction
    },
    "tag2": {
      "accuracy": 0.0647,   #Overall Tag2 Acc.
      "f1": 0.0995,    #Overall Tag2 F1 Acc.
      "rouge": {
        "rouge-1": 0.4707,
        "rouge-2": 0.2361,
        "rouge-l": 0.4359
      },
      "judgement_accuracy": 0.4221
    }
  },
  "category_metrics": {    #Coarse-grained Error Tags Acc.
    "tag1": {
      "Incomplete Information": {
        "accuracy": 0.2077,
        "f1": 0.2961,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Irrelevant Information": {
        "accuracy": 0.1289,
        "f1": 0.1968,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Erroneous Information": {
        "accuracy": 0.036,
        "f1": 0.0541,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Incomplete or Missing Response": {
        "accuracy": 0.1618,
        "f1": 0.2585,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Inaccurate or Misunderstood Response": {
        "accuracy": 0.273,
        "f1": 0.3803,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Irrelevant or Off-Topic Response": {
        "accuracy": 0.0103,
        "f1": 0.0188,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Overly Verbose Response": {
        "accuracy": 0.4259,
        "f1": 0.2771,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      }
    },
    "tag2": {
      "Insufficient or Incomplete Information Retrieval": {
        "accuracy": 0.2028,
        "f1": 0.2677,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Data Insufficiency in Retrieval": {
        "accuracy": 0.0053,
        "f1": 0.0106,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Relevance Gaps in Retrieval": {
        "accuracy": 0.2483,
        "f1": 0.2483,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Irrelevant Information Retrieval": {
        "accuracy": 0.0,
        "f1": 0,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Erroneous Information Retrieval": {
        "accuracy": 0.036,
        "f1": 0.0543,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Omission of Key Information": {
        "accuracy": 0.1565,
        "f1": 0.1513,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Lack of Specificity": {
        "accuracy": 0.0,
        "f1": 0,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Specificity and Precision Errors": {
        "accuracy": 0.0,
        "f1": 0,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Partial Coverage and Temporal Issues": {
        "accuracy": 0.0,
        "f1": 0,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Lack of Practicality": {
        "accuracy": 0.0,
        "f1": 0,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Contextual Understanding Errors": {
        "accuracy": 0.1971,
        "f1": 0.1843,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Factual Inaccuracies": {
        "accuracy": 0.0186,
        "f1": 0.0348,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Incorrect and Incomplete Answers": {
        "accuracy": 0.0073,
        "f1": 0.0143,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Misinterpretation of Queries and Information": {
        "accuracy": 0.0693,
        "f1": 0.0676,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Entity and Concept Confusion": {
        "accuracy": 0.0089,
        "f1": 0.0171,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Irrelevant Content and Topic Drift": {
        "accuracy": 0.0125,
        "f1": 0.0185,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Off-Topic and Redundant Responses": {
        "accuracy": 0.0,
        "f1": 0,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Content and Context Misalignment": {
        "accuracy": 0.0,
        "f1": 0,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      },
      "Overly Complex and Redundant Response": {
        "accuracy": 0.4259,
        "f1": 0.2788,
        "rouge": {
          "rouge-1": 0.4707,
          "rouge-2": 0.2361,
          "rouge-l": 0.4359
        }
      }
    }
  }
}

📜 License

Our dataset are distributed under the CC BY-NC 4.0 license.

🎖 Citation

Please cite our work if you find the repository helpful.

@inproceedings{dong2025ragcritic,
  author       = {Guanting Dong and
                  Jiajie Jin and
                  Xiaoxi Li and
                  Yutao Zhu and
                  Zhicheng Dou and
                  Ji{-}Rong Wen},
  editor       = {Wanxiang Che and
                  Joyce Nabende and
                  Ekaterina Shutova and
                  Mohammad Taher Pilehvar},
  title        = {RAG-Critic: Leveraging Automated Critic-Guided Agentic Workflow for
                  Retrieval Augmented Generation},
  booktitle    = {Proceedings of the 63rd Annual Meeting of the Association for Computational
                  Linguistics (Volume 1: Long Papers), {ACL} 2025, Vienna, Austria,
                  July 27 - August 1, 2025},
  pages        = {3551--3578},
  publisher    = {Association for Computational Linguistics},
  year         = {2025},
  url          = {https://aclanthology.org/2025.acl-long.179/},
  timestamp    = {Thu, 24 Jul 2025 21:25:39 +0200},
  biburl       = {https://dblp.org/rec/conf/acl/DongJL0DW25.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

❤️ Acknowledgement

We thank the following repositories for their great work: FlashRAG, transformers, Llama-Factory and FollowRAG.

About

The official code of our paper “RAG-Critic: Leveraging Automated Critic-Guided Agentic Workflow for Retrieval Augmented Generation”

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published