RLPR: Extrapolating RLVR To General Domains

中文 | English

🎊 News

[2025.06.23] We open-source the code, weights, data and paper of RLPR!

📜 Brief Introduction

We introduce the RLPR (Reinforcement Learning with Reference Probability Reward) framework that enhances the reasoning capabilities of Large Language Models (LLMs). RLPR uses LLM's generation probabilities as a reward signal and eliminates reliance of external verifiers. This approach enables robust, general-domain reasoning improvements with greater efficiency and broader applicability. Notable features of RLPR include:

💡 Stronger Reasoning Enhancement. RLPR achieves better reasoning capability enchancement on both mathematical and general-domain reasoning benchmarks, even surpassing strong methods using verifier models.

🛠️ Simple and Scalable Reward. RLPR features an efficient Probability-based Reward (PR) using average decoding probabilities of reference answers. Without the need for laborious rule-based verifier construction, we simply calculate rewards with a single forward pass.

🚀 Better Reward Quality and Robust Training.

PR exhibits better reward quality compared with rule-based, model-based reward, and naive likelihood as a reward.

We apply RLPR with different training prompt templates and find it achieves robustness reasoning capability enhancement.

📌Contents

RLPR: Extrapolating RLVR To General Domains
- Dataset
- Install
- Train
- Evaluation
- Citation

Dataset

We present the RLPR Train Dataset and evaluation benchmarks for easier usage.

Install

Clone this repository and navigate to RLPR folder

git clone https://github.com/OpenBMB/RLPR.git
cd RLPR

Install package

bash scripts/setup_env.sh

Train

Prepare data

Download the train and test dataset. Move rlpr_train.parquet to ./datasets/train, and move all the test datasets to ./datasets/test.

huggingface-cli download --repo-type dataset --resume-download openbmb/RLPR-Train-Dataset --local-dir ./datasets/train
huggingface-cli download --repo-type dataset --resume-download openbmb/RLPR-Evaluation --local-dir ./datasets/test

Specify the base model path in examples/RLPR/reproduce_<model>.sh, where <model> can be qwen, llama and gemma.

MODEL=path_to_base_model

(Optional) Login wandb and set USE_WANDB to True in the examples/RLPR/reproduce_<model>.sh if you want to use wandb for logging.

USE_WANDB=${USE_WANDB:-"false"}

(Optional) Follow the following steps to use the llm as a judge eval method. Skip this step if you want to use a rule-based verifier to judge the answer.
- Open-Source Model as judge
  1. Create a new environment for the server and deploy the model. (Specify judge model, host and port in the setup_server.sh)
```
bash scripts/setup_server.sh
```
  2. Specify the judge model in the examples/RLPR/reproduce_<model>.sh.
```
export CLIENT_IP=http://127.0.0.1:8001
export USED_MODEL=Qwen/Qwen2.5-72B-Instruct
```
- API-Based Model (gpt-4o / 4pt-4.1) as judge
  
  Specify token and the judge model in the examples/RLPR/reproduce_<model>.sh to use OpenAI API.
```
export OPENAI_API_KEY=your_api_token
export OPENAI_API_BASE=your_api_base  # default is https://api.openai.com/v1
export USED_MODEL=gpt-4.1
```
Run the training script

bash examples/RLPR/reproduce_qwen.sh
# bash examples/RLPR/reproduce_llama.sh
# bash examples/RLPR/reproduce_gemma.sh

Evaluation

Follow the steps 1~4 in the Train section to prepare the data, model and judge model (optional).
Run the evaluation script

bash examples/RLPR/reproduce_qwen.sh +trainer.val_only=True
# bash examples/RLPR/reproduce_llama.sh +trainer.val_only=True
# bash examples/RLPR/reproduce_gemma.sh +trainer.val_only=True

Licenses

Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Acknowledgement

veRL: The codebase we built upon.

Citation

If you find our model/code/data/paper helpful, please consider cite our papers 📝 and star us ⭐️！

@misc{yu2025rlprextrapolatingrlvrgeneral,
      title={RLPR: Extrapolating RLVR to General Domains without Verifiers}, 
      author={Tianyu Yu and Bo Ji and Shouli Wang and Shu Yao and Zefan Wang and Ganqu Cui and Lifan Yuan and Ning Ding and Yuan Yao and Zhiyuan Liu and Maosong Sun and Tat-Seng Chua},
      year={2025},
      eprint={2506.18254},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.18254}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
assets		assets
examples/RLPR		examples/RLPR
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
vllm-server-requirements.txt		vllm-server-requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RLPR: Extrapolating RLVR To General Domains

中文 | English

🎊 News

📜 Brief Introduction

📌Contents

Dataset

Install

Train

Evaluation

Licenses

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

OpenBMB/RLPR

Folders and files

Latest commit

History

Repository files navigation

RLPR: Extrapolating RLVR To General Domains

中文 | English

🎊 News

📜 Brief Introduction

📌Contents

Dataset

Install

Train

Evaluation

Licenses

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages