We introduce the RLPR (Reinforcement Learning with Reference Probability Reward) framework that enhances the reasoning capabilities of Large Language Models (LLMs). RLPR uses LLM's generation probabilities as a reward signal and eliminates reliance of external verifiers. This approach enables robust, general-domain reasoning improvements with greater efficiency and broader applicability. Notable features of RLPR include:
💡 Stronger Reasoning Enhancement. RLPR achieves better reasoning capability enchancement on both mathematical and general-domain reasoning benchmarks, even surpassing strong methods using verifier models.
🛠️ Simple and Scalable Reward. RLPR features an efficient Probability-based Reward (PR) using average decoding probabilities of reference answers. Without the need for laborious rule-based verifier construction, we simply calculate rewards with a single forward pass.
🚀 Better Reward Quality and Robust Training.
PR exhibits better reward quality compared with rule-based, model-based reward, and naive likelihood as a reward.
We apply RLPR with different training prompt templates and find it achieves robustness reasoning capability enhancement.
We present the RLPR Train Dataset and evaluation benchmarks for easier usage.
- Clone this repository and navigate to RLPR folder
git clone https://github.com/OpenBMB/RLPR.git
cd RLPR
- Install package
bash scripts/setup_env.sh
- Prepare data
Download the train and test dataset. Move rlpr_train.parquet
to ./datasets/train
, and move all the test datasets to ./datasets/test
.
huggingface-cli download --repo-type dataset --resume-download openbmb/RLPR-Train-Dataset --local-dir ./datasets/train
huggingface-cli download --repo-type dataset --resume-download openbmb/RLPR-Evaluation --local-dir ./datasets/test
- Specify the base model path in
examples/RLPR/reproduce_<model>.sh
, where<model>
can beqwen
,llama
andgemma
.
MODEL=path_to_base_model
- (Optional) Login wandb and set USE_WANDB to True in the
examples/RLPR/reproduce_<model>.sh
if you want to use wandb for logging.
USE_WANDB=${USE_WANDB:-"false"}
-
(Optional) Follow the following steps to use the
llm as a judge
eval method. Skip this step if you want to use a rule-based verifier to judge the answer.-
Open-Source Model as judge
-
Create a new environment for the server and deploy the model. (Specify judge model, host and port in the
setup_server.sh
)bash scripts/setup_server.sh
-
Specify the judge model in the
examples/RLPR/reproduce_<model>.sh
.export CLIENT_IP=http://127.0.0.1:8001 export USED_MODEL=Qwen/Qwen2.5-72B-Instruct
-
-
API-Based Model (gpt-4o / 4pt-4.1) as judge
Specify token and the judge model in the
examples/RLPR/reproduce_<model>.sh
to use OpenAI API.export OPENAI_API_KEY=your_api_token export OPENAI_API_BASE=your_api_base # default is https://api.openai.com/v1 export USED_MODEL=gpt-4.1
-
-
Run the training script
bash examples/RLPR/reproduce_qwen.sh
# bash examples/RLPR/reproduce_llama.sh
# bash examples/RLPR/reproduce_gemma.sh
-
Follow the steps 1~4 in the Train section to prepare the data, model and judge model (optional).
-
Run the evaluation script
bash examples/RLPR/reproduce_qwen.sh +trainer.val_only=True
# bash examples/RLPR/reproduce_llama.sh +trainer.val_only=True
# bash examples/RLPR/reproduce_gemma.sh +trainer.val_only=True
Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
- veRL: The codebase we built upon.
If you find our model/code/data/paper helpful, please consider cite our papers 📝 and star us ⭐️!
@misc{yu2025rlprextrapolatingrlvrgeneral,
title={RLPR: Extrapolating RLVR to General Domains without Verifiers},
author={Tianyu Yu and Bo Ji and Shouli Wang and Shu Yao and Zefan Wang and Ganqu Cui and Lifan Yuan and Ning Ding and Yuan Yao and Zhiyuan Liu and Maosong Sun and Tat-Seng Chua},
year={2025},
eprint={2506.18254},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.18254},
}