Skip to content

THU-KEG/VerIF

Repository files navigation

VerIF: Verification Engineering for RL in Instruction Following


Introduction

VerIF is a practical and efficient method for verification in instruction-following reinforcement learning. Built on the idea of Reinforcement Learning with Verifiable Rewards (RLVR), VerIF integrates rule-based code checks with LLM-based reasoning verification (e.g., QwQ-32B) to provide accurate and scalable reward signals.

To support this method, we construct a high-quality dataset, VerInstruct, with ~22,000 instruction-following instances paired with verification signals. Models trained with VerIF not only achieve state-of-the-art performance on several benchmarks across models at similar scale but also maintain their general capabilities.

🔥 Results

Result Chart

RL with VerIF significantly improves instruction-following performance across benchmarks.

Method

Method Figure

VerIF integrates rule-based code checks with LLM-based reasoning verification (e.g., QwQ-32B) to provide accurate and scalable reward signals.


Data & Trained Models


Training Guide

This repo is forked from verl. We sincerely thank the authors for their excellent framework. We introduce two key adjustments:

  1. Efficient Local Reward Server:
    We provide a local_server version of the reward function for better efficiency. We recommend running it inside a sandboxed Docker environment to avoid potential security issues. You may also deploy your own remote server.

  2. Batch Reward Collection:
    We modified ./verl/workers/reward_manager/naive.py to support batched reward calculation, which is more efficient than the original loop-based implementation. We do not modify other parts of the repo.


Quick Start (RL Using VerIF)

Please refer to the original verl documentation for environment setup.

Step 1: Preprocess Data

Download data from here. Use ./examples/data_preprocess/if_prompts.py to preprocess VerInstruct.

Make sure to add the import path for ./verl/utils/reward_score/local_server at the top of each function.

Step 2: Setup the Verifier Model

For soft constraint verification, use an LLM-based verifier. You may:

  • Use our own trained verifier based on R1-Distilled-Qwen-7B
  • Use QwQ-32B as the verifier

We suggest using SGLang or vLLM for deployment.
Then modify ./verl/utils/reward_score/local_server/llm_call.py with your API endpoint and model name.

Step 3: Start Training

Use the provided training scripts:

  • ./examples/grpo_trainer/run_qwen2-7b_verif.sh
  • ./examples/grpo_trainer/run_tulu3-8b_verif.sh

These use DeepSeek-RL-Distilled-Qwen-7B and TULU 3 SFT as base models.
Update paths to point to your model checkpoint if needed.


Acknowledgments

We thank the verl team for their open-source framework, and the Crab team for their open-sourced original data.

Citations

If this repo helps, please kindly cite us:

@misc{peng2025verif,
      title={VerIF: Verification Engineering for Reinforcement Learning in Instruction Following}, 
      author={Hao Peng and Yunjia Qi and Xiaozhi Wang and Bin Xu and Lei Hou and Juanzi Li},
      year={2025},
      eprint={2506.09942},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.09942}, 
}

About

[EMNLP 2025] Verification Engineering for RL in Instruction Following

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published