Skip to content

official implementation of ICLR'2025 paper: Rethinking Bradley-Terry Models in Preference-based Reward Modeling: Foundations, Theory, and Alternatives

License

Notifications You must be signed in to change notification settings

holarissun/RewardModelingBeyondBradleyTerry

Repository files navigation

🚀 Official Implementation for ICLR'2025 Paper

Paper: Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives

    🌍 [Website]     |     📖 [Preprint]     |     📚 [Embeddings (To be released soon)]     |     ⚙️ [Infrastructure]


We have a series of work focusing on embedding-based reward models in RLHF:

  • Part I. Reward Model Foundation (This paper: foundation of preference-based reward modeling and embedding-based reward models)
  • Part II. Active Reward Modeling ([preprint], [repo])
  • Part III. Accelerating Reward Model Research with our Infra. ([preprint], [repo])
  • Part IV. Human Preference Learning through Principal Component Analysis ([preprint])

⚙️ Infra for Easy-Reproducible Reward Model Research

The reproduction for reward modeling research has long been a challenge, given its high demand for hardware and cost in training, evaluation, and inference. We propose to conduct easy-reproducible reward model research on the embedding space.

This paper posits details of the workflow: [Part III. TO BE RELEASED SOON.]. Our motivation is to make it possible for every researcher with a single CPU to conduct reward modeling (and RLHF) research.

🔁 Reproducing the Results without GPUs

Part 1: Reproducing Data: this step is optional, and computationally expensive.

  • Step 1 (optional, GPU required): SFT (you need to update the PATH to the models/open-sourced datasets. You may need to apply for licences to use those models/datasets first.) Note that
python3 step1_sft.py --model_name gemma2b --dataset hh-rlhf-helpful-gpt4
  • Step 2 (optional, GPU required): Generate samples on training (10 per prompt) and testing prompts (500 per prompt)
python3 step2_gen_sample.py --model_name gemma2b --adapter_name sft --dataset hh-rlhf-helpful-gpt4 --eval_dataset hh-rlhf-helpful --data_class train --n_samples 10 --max_len 128
python3 step2_gen_sample.py --model_name gemma2b --adapter_name sft --dataset hh-rlhf-helpful-gpt4 --eval_dataset hh-rlhf-helpful --data_class test --n_samples 500 --max_len 128
  • Step 3 (optional, GPU required): annotating response qualities using golden reward models
python3 step3_reward_annotation.py --adapter_name sft --model_name gemma2b --dataset hh-rlhf-helpful-gpt4 --data_class train --n_samples 10
python3 step3.5_processing_data.py # pre-process and re-org the dataset
  • Step 4 (optional, GPU required): Generate and store embeddings of all prompt-response pairs
python3 step4_gen_embeddings.py --model_name gemma2b --dataset hh-rlhf-helpful-gpt4 --gen_pref_model_name gemma2b --train_test train --n_samples 10

The above 4 steps enable us to create an embedding-based dataset, then we can easily reproduce any research with such a dataset

To illustrate: example code

Part 2: Reproducing Results: this step is computationally efficient, and the reproduction can be done with a CPU-only machine.

  • Step 5 (reproduction: reward model training with CPUs)
python3 step5_train_rms.py --embed_model_name gemma2b --task helpful --sft_obj gpt4 --gen_pref_model_name gemma2b --rm_objective clf --consider_first_n 2 --annotation_quality 10
# task: helpful / harmless: to specify task
# sft_obj: gpt4 / none: to load data generated by different model checkpoints (fine-tuned or not fine-tuned models)
# gen_pref_model_name: gemma2b / gemma7b / llama38b: to experiment with different models
# rm_objective: clf / bt: for classification models and Bradley-Terry models
# consider_first_n: 2 / -1 / -2: to specify the comparison format, using 2 means randomly selecting 2 out of the 10 generated responses; using -2 means experiment with comparisons that lack diversity; using -1 means experiment with comparisons that have high diversity.
# annotation_quality: 10 means high annotation quality (error rate lower than 5 percent), 1.0, 0.5, 0.1 leads to low-quality annotations that has higher error rates up to 40 percent.
  • Step 6 (reproduction: reward model evaluation with CPUs)
python3 step6_eval_rms.py --embed_model_name gemma2b --task helpful --sft_obj gpt4 --gen_pref_model_name gemma2b --rm_objective clf --consider_first_n 2 --annotation_quality 10

Call for Contribution to the Infra (an Embedding-based Dataset for Reward Modeling Research)

Call for contributors! --- Please contact me at sunhopht@gmail.com if your are interested in contributing your embedding / golden-reward annotations in your reward model research to the open-source RM community!

📚 BibTex Citation

If you would like to cite our code or paper, please use

@inproceedings{
  sun2025rethinking,
  title={Rethinking Reward Modeling in Preference-based Large Language Model Alignment},
  author={Hao Sun and Yunyi Shen and Jean-Francois Ton},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=rfdblE10qm}
}


About

official implementation of ICLR'2025 paper: Rethinking Bradley-Terry Models in Preference-based Reward Modeling: Foundations, Theory, and Alternatives

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages