This repository presents BPO, a novel framework that dynamically balances the optimization of chosen and rejected responses through two key components: balanced reward margin and gap adaptor. Unlike previous methods, BPO can fundamentally resolve DPO’s DCR issue, without introducing additional constraints to the loss function. Experimental results on multiple mathematical reasoning tasks show that BPO significantly outperforms DPO, improving accuracy by +10.1% with Llama-3.1-8B-Instruct (18.8% → 28.9%) and +11.7% with Qwen2.5-15 Math-7B (35.0% → 46.7%). It also surpasses DPO variants by +3.6% over IPO (43.1%), +5.0% over SLiC (41.7%), and +3.1% over Cal-DPO (43.6%) on the same model. Remarkably, our algorithm requires only a single line of code modification, making it simple to implement and fully compatible with existing DPO-based frameworks.
Overall performance across five competition-level math reasoning benchmarks. Results for BPO indicate the mean accuracy across all datasets. The table demonstrates that BPO outperforms both standard DPO and its variants, achieving the highest average accuracy.
Method ( |
AIME2024 | MATH500 | AMC2023 | Minerva Math | Olympiad Bench | Avg. |
---|---|---|---|---|---|---|
GPT-4o | 9.3 | 76.4 | 45.8 | 36.8 | 43.3 | 43.3 |
Llama-3.1-70B-Instruct | 16.7 | 64.6 | 30.1 | 35.3 | 31.9 | 35.7 |
Qwen2.5-Math-7B-Base | 23.3 | 66.4 | 47.5 | 13.2 | 24.4 | 35.0 |
Qwen2.5-Math-7B-Base-SFT | 20.0 | 73.2 | 62.5 | 30.5 | 35.6 | 44.4 |
Qwen2.5-Math-7B-Instruct | 13.3 | 79.8 | 50.6 | 34.6 | 40.7 | 43.8 |
Qwen2.5-7B-RAFT-Zero | 20.0 | 77.6 | 55.0 | 30.5 | 38.7 | 44.4 |
DPO | 6.7 | 71.2 | 55.0 | 39.3 | 32.9 | 41.0 |
IPO | 10.0 | 75.6 | 52.5 | 39.7 | 37.6 | 43.1 |
SLiC | 10.0 | 73.2 | 55.0 | 37.5 | 33.0 | 41.7 |
Cal-DPO | 20.0 | 75.4 | 62.5 | 24.3 | 35.9 | 43.6 |
DPOP | 23.3 | 77.0 | 57.5 | 30.9 | 35.9 | 44.9 |
BPO (ours) | 30.0 | 75.8 | 60.0 | 31.2 | 36.3 | 46.7 |
Performance comparison across different model architectures and scales, it shows that BPO consistently outperforms DPO across all configurations and datasets.
Base Model | Method | AIME2024 | MATH500 | AMC2023 | Minerva Math | Olympiad Bench | Avg. |
---|---|---|---|---|---|---|---|
Llama-3.1-8B-Instruct | DPO | 3.3 | 44.6 | 12.5 | 22.1 | 11.6 | 18.8 |
BPO | 10.0 | 50.6 | 40.0 | 27.2 | 16.7 | 28.9 | |
Qwen2.5-Math-1.5B-Base | DPO | 3.3 | 58.8 | 27.5 | 27.6 | 23.6 | 28.2 |
BPO | 16.7 | 64.8 | 52.5 | 26.8 | 30.5 | 38.3 | |
Qwen2.5-Math-7B-Base | DPO | 6.7 | 71.2 | 55.0 | 39.3 | 32.9 | 41.0 |
BPO | 30.0 | 75.8 | 60.0 | 31.2 | 36.3 | 46.7 | |
Qwen2.5-Math-7B-Instruct | DPO | 10.0 | 77.0 | 60.0 | 28.7 | 38.1 | 42.8 |
BPO | 20.0 | 82.4 | 60.0 | 40.8 | 40.6 | 48.8 |
Performance comparison between the relative reward margin
Loss Type | Algorithm | x_1 - x_2 | min(x_1, -x_2) | △ (gain) |
---|---|---|---|---|
Logistic log loss | DPO | 41.0 | 44.5 | + 3.5 |
Hinge loss | SLiC | 41.7 | 46.7 | + 5.0 |
Squared loss | IPO | 43.1 | 43.9 | + 0.8 |
Exponential loss | N/A | 43.5 | 43.9 | + 0.4 |
Truncated quadratic loss | N/A | 42.4 | 44.3 | + 1.9 |
Savage loss | N/A | 42.7 | 43.7 | + 1.0 |
pip install -r requirements.txt
pip install trl math-verify
Execute the following command to train the model.
CUDA_VISIBLE_DEVICES=0 python train_dpo.py --config train_config.yaml
Execute the following command to evaluate the trained model.
bash math_evaluation/sh/evaluate_all_bench.sh
[2025/05/30] We're releasing the following components:
- 🚀 Training: Complete implementation of our training pipeline.
- 🔥 BPO Dataset: Our curated dataset of 28k mathematical questions (to be update).
- 🤖 BPO Model: Model training with BPO (to be update).
Our work builds upon the insightful technical reports from Cal-DPO and DPOP. We extend our appreciation to the Qwen-Math team for their open-source model, to the creators of TRL and vLLM for providing the essential reinforcement learning framework and inference infrastructure, respectively, that enabled this research.
If you find this work useful, please cite our paper:
@misc{sun2025bpo,
title={BPO: Revisiting Preference Modeling in Direct Preference Optimization},
author={Lin Sun and Chuang Liu and Peng Liu and Bingyang Li and Weijia Lu and Ning Wu},
year={2025},
eprint={2506.03557},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.03557},
}