GitHub - THU-KEG/AgentIF: AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

We introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. Here is the instruction length distribution in AgentIF, along with the success rates of several representative LLMs across the constraint dimensions we propose: An example instruction of AgentIF:

Leaderboard

Metrics

Constraint Success Rate (CSR) measures the proportion of individual constraints that are correctly satisfied by the model’s response.
Instruction Success Rate (ISR) measures the proportion of instructions for which all constraints are satisfied.

Performance Across Constraint Categories

Evaluation

For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation.

How to evaluation

Clone the remote repository to your local environment. The necessary data is already included, so no further actions are needed.
```
git clone https://github.com/THU-KEG/AgentIF.git
```

(Optional) To evaluate a model hosted locally, deploy it using vLLM. Use a command similar to the following:

CUDA_VISIBLE_DEVICES=<CUDA_ID> vllm serve "<your_model_path>" \
    --served-model-name <your_model_name> \
    --port 8008 \
    --tensor-parallel-size <num_gpus> \
    --max-model-len 32000 \
    --gpu-memory-utilization 0.9

Specify the target model and the evaluator in the run.sh file. To reproduce our results, we recommend using gpt-4o-2024-11-20.

Model_Name=""             # Name of the model to evaluate
Model_Name_URL=""         # Endpoint of the model (e.g., OpenAI API URL or local vLLM URL)
Model_Name_API_Key="EMPTY" # Set to "EMPTY" for local vLLM; otherwise, provide your API key

Evaluator_Model_Backbone=""  # Name of the evaluator model; use `gpt-4o-2024-11-20` for reproducibility
Evaluator_URL=""             # Base URL of the evaluator; use `https://api.openai.com/v1` to match our setup
Evaluator_API_Key=""         # API key for the evaluator

Then run the script to start the evaluation.
```
sh run.sh
```

Citation

@misc{qi2025agentifbenchmarkinginstructionfollowing,
      title={AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios}, 
      author={Yunjia Qi and Hao Peng and Xiaozhi Wang and Amy Xin and Youfeng Liu and Bin Xu and Lei Hou and Juanzi Li},
      year={2025},
      eprint={2505.16944},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.16944}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code4eval		code4eval
data		data
images		images
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Leaderboard

Metrics

Performance Across Constraint Categories

Evaluation

How to evaluation

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

THU-KEG/AgentIF

Folders and files

Latest commit

History

Repository files navigation

Leaderboard

Metrics

Performance Across Constraint Categories

Evaluation

How to evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages