Skip to content

OpenBMB/RLPR

Repository files navigation

RLPR: Extrapolating RLVR To General Domains

中文 | English

🎊 News

📜 Brief Introduction

We introduce the RLPR (Reinforcement Learning with Reference Probability Reward) framework that enhances the reasoning capabilities of Large Language Models (LLMs). RLPR uses LLM's generation probabilities as a reward signal and eliminates reliance of external verifiers. This approach enables robust, general-domain reasoning improvements with greater efficiency and broader applicability. Notable features of RLPR include:

💡 Stronger Reasoning Enhancement. RLPR achieves better reasoning capability enchancement on both mathematical and general-domain reasoning benchmarks, even surpassing strong methods using verifier models.

🛠️ Simple and Scalable Reward. RLPR features an efficient Probability-based Reward (PR) using average decoding probabilities of reference answers. Without the need for laborious rule-based verifier construction, we simply calculate rewards with a single forward pass.

🚀 Better Reward Quality and Robust Training.

PR exhibits better reward quality compared with rule-based, model-based reward, and naive likelihood as a reward.

We apply RLPR with different training prompt templates and find it achieves robustness reasoning capability enhancement.

📌Contents

Dataset

We present the RLPR Train Dataset and evaluation benchmarks for easier usage.

Install

  1. Clone this repository and navigate to RLPR folder
git clone https://github.com/OpenBMB/RLPR.git
cd RLPR
  1. Install package
bash scripts/setup_env.sh

Train

  1. Prepare data

Download the train and test dataset. Move rlpr_train.parquet to ./datasets/train, and move all the test datasets to ./datasets/test.

huggingface-cli download --repo-type dataset --resume-download openbmb/RLPR-Train-Dataset --local-dir ./datasets/train
huggingface-cli download --repo-type dataset --resume-download openbmb/RLPR-Evaluation --local-dir ./datasets/test
  1. Specify the base model path in examples/RLPR/reproduce_<model>.sh, where <model> can be qwen, llama and gemma.
MODEL=path_to_base_model
  1. (Optional) Login wandb and set USE_WANDB to True in the examples/RLPR/reproduce_<model>.sh if you want to use wandb for logging.
USE_WANDB=${USE_WANDB:-"false"}
  1. (Optional) Follow the following steps to use the llm as a judge eval method. Skip this step if you want to use a rule-based verifier to judge the answer.

    • Open-Source Model as judge

      1. Create a new environment for the server and deploy the model. (Specify judge model, host and port in the setup_server.sh)

        bash scripts/setup_server.sh
      2. Specify the judge model in the examples/RLPR/reproduce_<model>.sh.

        export CLIENT_IP=http://127.0.0.1:8001
        export USED_MODEL=Qwen/Qwen2.5-72B-Instruct
    • API-Based Model (gpt-4o / 4pt-4.1) as judge

      Specify token and the judge model in the examples/RLPR/reproduce_<model>.sh to use OpenAI API.

      export OPENAI_API_KEY=your_api_token
      export OPENAI_API_BASE=your_api_base  # default is https://api.openai.com/v1
      export USED_MODEL=gpt-4.1
  2. Run the training script

bash examples/RLPR/reproduce_qwen.sh
# bash examples/RLPR/reproduce_llama.sh
# bash examples/RLPR/reproduce_gemma.sh

Evaluation

  1. Follow the steps 1~4 in the Train section to prepare the data, model and judge model (optional).

  2. Run the evaluation script

bash examples/RLPR/reproduce_qwen.sh +trainer.val_only=True
# bash examples/RLPR/reproduce_llama.sh +trainer.val_only=True
# bash examples/RLPR/reproduce_gemma.sh +trainer.val_only=True

Licenses

Code License Data License

Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Acknowledgement

  • veRL: The codebase we built upon.

Citation

If you find our model/code/data/paper helpful, please consider cite our papers 📝 and star us ⭐️!

@misc{yu2025rlprextrapolatingrlvrgeneral,
      title={RLPR: Extrapolating RLVR to General Domains without Verifiers}, 
      author={Tianyu Yu and Bo Ji and Shouli Wang and Shu Yao and Zefan Wang and Ganqu Cui and Lifan Yuan and Ning Ding and Yuan Yao and Zhiyuan Liu and Maosong Sun and Tat-Seng Chua},
      year={2025},
      eprint={2506.18254},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.18254}, 
}

About

Extrapolating RLVR to General Domains without Verifiers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •