This repository contains resources referenced in the paper Reinforcement Learning Enhanced LLMs: A Survey.
If you find this repository helpful, please cite the following:
@misc{wang2024reinforcementlearningenhancedllms,
title={Reinforcement Learning Enhanced LLMs: A Survey},
author={Shuhe Wang and Shengyu Zhang and Jie Zhang and Runyi Hu and Xiaoya Li and Tianwei Zhang and Jiwei Li and Fei Wu and Guoyin Wang and Eduard Hovy},
year={2024},
eprint={2412.10400},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.10400},
}
Stay tuned! More related work will be updated!
- [17 Dec, 2024] The repository is created.
- [5 Dec, 2024] We release the first version of the paper.
- Reinforcement-Learning-Enhanced-LLMs-A-Survey
The paper surveys research in the rapidly growing field of enhancing large language models (LLMs) with reinforcement learning (RL), a technique that enables LLMs to improve their performance by receiving feedback in the form of rewards based on the quality of their outputs, allowing them to generate more accurate, coherent, and contextually appropriate responses. In this work, we make a systematic review of the most up-to-date state of knowledge on RLenhanced LLMs, attempting to consolidate and analyze the rapidly growing research in this field, helping researchers understand the current challenges and advancements.
Specifically, we (1) detail the basics of RL; (2) introduce popular RL-enhanced LLMs; (3) review researches on two widely-used reward modelbased RL techniques: Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF); and (4) explore Direct Preference Optimization (DPO), a set of methods that bypass the reward model to directly use human preference data for aligning LLM outputs with human expectations. We will also point out current challenges and deficiencies of existing methods and suggest some avenues for further improvements
The typology of the paper is as follows:
RL Enhanced LLMs | Organization | # Params | Project | Paper | Open Source |
---|---|---|---|---|---|
Instruct-GPT [1] | OpenAI | 1.3B, 6B, 175B | project | paper | No |
GPT-4 [2] | OpenAI | - | - | paper | No |
Gemini [3] | - | - | paper | No | |
InternLM2 [4] | Shanghai AI Laboratory | 1.8B, 7B, 20B | project | paper | Yes |
Claude 3 [5] | Anthropic | - | project | - | No |
Reka [6] | Reka | 7B, 21B | project | paper | No |
Zephyr [7] | Argilla | 141B-A39B | project | - | Yes |
Phi-3 [8] | Microsoft | 3.8B, 7B, 14B | project | paper | Yes |
DeepSeek-V2 [9] | DeepSeek-AI | 236B-A21B | project | paper | Yes |
ChatGLM [10] | Team GLM | 6B, 9B | project | paper | Yes |
Nemotron-4 340B [11] | NVIDIA | 340B | project | paper | Yes |
Llama 3 [12] | Meta | 8B, 70B, 405B | project | paper | Yes |
Qwen2 [13] | Qwen Team, Alibaba Group | (0.5-72)B, 57B-A14B | project | paper | Yes |
Gemma2 [14] | 2B, 9B, 27B | project | paper | Yes | |
Starling-7B [15] | Berkeley | 7B | project | paper | Yes |
Athene-70B [16] | Nexusflow | 70B | project | paper | Yes |
Hermes 3 [17] | Nous Research | 8B, 70B, 405B | project | paper | Yes |
o1 [18] | OpenAI | - | project | - | No |
Kimi-k1.5 [19] | Moonshot AI | - | - | paper | No |
DeepSeek-R1 [20] | DeepSeek | 671B-A31B | project | paper | Yes |
[1] Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others. Training language models to follow instructions with human feedback. Advances in neural information processing systems. Paper
[2] OpenAI. GPT-4 Technical Report. arXiv. Paper
[3] Team, Gemini and Anil, Rohan and Borgeaud, Sebastian and Alayrac, Jean-Baptiste and Yu, Jiahui and Soricut, Radu and Schalkwyk, Johan and Dai, Andrew M and Hauth, Anja and Millican, Katie and others. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805. Paper
[4] Cai, Zheng and Cao, Maosong and Chen, Haojiong and Chen, Kai and Chen, Keyu and Chen, Xin and Chen, Xun and Chen, Zehui and Chen, Zhi and Chu, Pei and others. Internlm2 technical report. arXiv preprint arXiv:2403.17297. Paper
[5] Anthropic. Claude 3 Family. Project
[6] Team, Reka and Ormazabal, Aitor and Zheng, Che and d'Autume, Cyprien de Masson and Yogatama, Dani and Fu, Deyu and Ong, Donovan and Chen, Eric and Lamprecht, Eugenie and Pham, Hai and others. Reka core, flash, and edge: A series of powerful multimodal language models. arXiv preprint arXiv:2404.12387. Paper
[7] HuggingFaceH4. Zephyr-ORPO-141b-A35b-v0.1. Project
[8] Abdin, Marah and Aneja, Jyoti and Awadalla, Hany and Awadallah, Ahmed and Awan, Ammar Ahmad and Bach, Nguyen and Bahree, Amit and Bakhtiari, Arash and Bao, Jianmin and Behl, Harkirat and others. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219. Paper
[9] Liu, Aixin and Feng, Bei and Wang, Bin and Wang, Bingxuan and Liu, Bo and Zhao, Chenggang and Dengr, Chengqi and Ruan, Chong and Dai, Damai and Guo, Daya and others. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434. Paper
[10] GLM, Team and Zeng, Aohan and Xu, Bin and Wang, Bowen and Zhang, Chenhui and Yin, Da and Rojas, Diego and Feng, Guanyu and Zhao, Hanlin and Lai, Hanyu and others. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv preprint arXiv:2406.12793. Paper
[11] Adler, Bo and Agarwal, Niket and Aithal, Ashwath and Anh, Dong H and Bhattacharya, Pallab and Brundyn, Annika and Casper, Jared and Catanzaro, Bryan and Clay, Sharon and Cohen, Jonathan and others. Nemotron-4 340B Technical Report. arXiv preprint arXiv:2406.11704. Paper
[12] Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Yang, Amy and Fan, Angela and others. The llama 3 herd of models. arXiv preprint arXiv:2407.21783. Paper
[13] Yang, An and Yang, Baosong and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Zhou, Chang and Li, Chengpeng and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and others. Qwen2 technical report. arXiv preprint arXiv:2407.10671. Paper
[14] Team, Gemma and Riviere, Morgane and Pathak, Shreya and Sessa, Pier Giuseppe and Hardin, Cassidy and Bhupatiraju, Surya and Hussenot, L{'e}onard and Mesnard, Thomas and Shahriari, Bobak and Ram{'e}, Alexandre and others. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118. Paper
[15] Zhu, Banghua and Frick, Evan and Wu, Tianhao and Zhu, Hanlin and Ganesan, Karthik and Chiang, Wei-Lin and Zhang, Jian and Jiao, Jiantao. Starling-7b: Improving helpfulness and harmlessness with RLAIF. First Conference on Language Modeling. Paper
[16] Nexusflow. Athene-Llama3-70B: Advancing Open-Weight Chat Models. Paper
[17] Teknium, Ryan and Quesnelle, Jeffrey and Guang, Chen. Hermes 3 technical report. arXiv preprint arXiv:2408.11857. Paper
[18] OpenAI. Hello, {GPT}-4o. Project
[19] Kimi Team et al., 2025. Kimi k1.5: Scaling Reinforcement Learning with LLMs. arXiv preprint arXiv:2501.12599. Paper
[20] DeepSeek-AI et al., 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948. Paper
Reinforcement learning from human feedback (RLHF) is a training approach that combines reinforcement learning (RL) with human feedback to align LLMs with human values, preferences, and expectations. RLHF consists of two main components: (1) Collecting Human Feedback to Train Reward Model, where human evaluators provide feedback on the LLM's outputs by scoring or ranking responses based on factors such as quality and relevance. This feedback is then used to train a reward model that predicts the quality of the outputs and serves as the reward function in the RL process; and (2) Preference Optimization Using Human Feedback, where the trained reward model guides the optimization of the LLM's outputs to maximize predicted rewards, aligning the LLM's behavior with human preferences.
[1] Liu, Chris Yuhao and Zeng, Liang and Liu, Jiacai and Yan, Rui and He, Jujie and Wang, Chaojie and Yan, Shuicheng and Liu, Yang and Zhou, Yahui. Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs. arXiv preprint arXiv:2410.18451. Paper
[2] Ivison, Hamish and Wang, Yizhong and Pyatkin, Valentina and Lambert, Nathan and Peters, Matthew and Dasigi, Pradeep and Jang, Joel and Wadden, David and Smith, Noah A and Beltagy, Iz and others. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702. Paper
[1] Yuan, Zheng and Yuan, Hongyi and Tan, Chuanqi and Wang, Wei and Huang, Songfang and Huang, Fei. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302. Paper
[2] Ahmadian, Arash and Cremer, Chris and Gall{'e}, Matthias and Fadaee, Marzieh and Kreutzer, Julia and Pietquin, Olivier and {"U}st{"u}n, Ahmet and Hooker, Sara. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740. Paper
[3] Song, Feifan and Yu, Bowen and Li, Minghao and Yu, Haiyang and Huang, Fei and Li, Yongbin and Wang, Houfeng. Preference ranking optimization for human alignment. Proceedings of the AAAI Conference on Artificial Intelligence. Paper
[4] Swamy, Gokul and Dann, Christoph and Kidambi, Rahul and Wu, Zhiwei Steven and Agarwal, Alekh. A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056. Paper
Reinforcement learning from AI feedback (RLAIF) serves as a promising alternative or supplement to RLHF that leverages AI systems—often more powerful or specialized LLMs (e.g., GPT-4)—to provide feedback on the outputs of the LLM being trained. This approach provides benefits such as scalability, consistency, and cost efficiency while minimizing reliance on human evaluators. Below, we explore several methods for substituting human feedback with AI feedback in reinforcement learning, highlighting approaches: (1) Distilling AI Feedback to Train Reward Model, (2) Prompting LLMs As a Reward Function, and (3) Self-Rewarding.
[1] Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377. Paper
[2] Xu, Zhangchen and Jiang, Fengqing and Niu, Luyao and Deng, Yuntian and Poovendran, Radha and Choi, Yejin and Lin, Bill Yuchen. Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. arXiv preprint arXiv:2406.08464. Paper
[3] Wang, Zhilin and Dong, Yi and Delalleau, Olivier and Zeng, Jiaqi and Shen, Gerald and Egert, Daniel and Zhang, Jimmy J and Sreedhar, Makesh Narsimhan and Kuchaiev, Oleksii. HelpSteer2: Open-source dataset for training top-performing reward models. arXiv preprint arXiv:2406.08673. Paper
[4] Park, Junsoo and Jwa, Seungyeon and Ren, Meiying and Kim, Daeyoung and Choi, Sanghyuk. Offsetbias: Leveraging debiased data for tuning evaluators. arXiv preprint arXiv:2407.06551. Paper
[1] Du, Yuqing and Watkins, Olivia and Wang, Zihan and Colas, C{'e}dric and Darrell, Trevor and Abbeel, Pieter and Gupta, Abhishek and Andreas, Jacob. Guiding pretraining in reinforcement learning with large language models. International Conference on Machine Learning. Paper
[2] Kwon, Minae and Xie, Sang Michael and Bullard, Kalesha and Sadigh, Dorsa. Reward design with language models. arXiv preprint arXiv:2303.00001. Paper
[3] Ma, Yecheng Jason and Liang, William and Wang, Guanzhi and Huang, De-An and Bastani, Osbert and Jayaraman, Dinesh and Zhu, Yuke and Fan, Linxi and Anandkumar, Anima. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931. Paper
[4] Xie, Tianbao and Zhao, Siheng and Wu, Chen Henry and Liu, Yitao and Luo, Qian and Zhong, Victor and Yang, Yanchao and Yu, Tao. Text2reward: Automated dense reward function generation for reinforcement learning. arXiv preprint arXiv:2309.11489. Paper
[5] Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Lu, Kellie and Mesnard, Thomas and Bishop, Colton and Carbune, Victor and Rastogi, Abhinav. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267. Paper
[6] Zhang, Lunjun and Hosseini, Arian and Bansal, Hritik and Kazemi, Mehran and Kumar, Aviral and Agarwal, Rishabh. Generative verifiers: Reward modeling as next-token predictionk. arXiv preprint arXiv:2408.15240. Paper
[1] Song, Jiayang and Zhou, Zhehua and Liu, Jiawei and Fang, Chunrong and Shu, Zhan and Ma, Lei. Self-refined large language model as automated reward function designer for deep reinforcement learning in robotics. arXiv preprint arXiv:2309.06687. Paper
[2] Yuan, Weizhe and Pang, Richard Yuanzhe and Cho, Kyunghyun and Sukhbaatar, Sainbayar and Xu, Jing and Weston, Jason. Self-rewarding language models. arXiv preprint arXiv:2401.10020. Paper
[3] Ye, Ziyi and Li, Xiangsheng and Li, Qiuchi and Ai, Qingyao and Zhou, Yujia and Shen, Wei and Yan, Dong and Liu, Yiqun. Beyond Scalar Reward Model: Learning Generative Judge from Preference Data. arXiv preprint arXiv:2410.03742. Paper
While RLHF and RLAIF are effective methods for aligning LLMs with desired behaviors, there are still challenges that require careful analysis. These include addressing out-of-distribution issues between the trained reward models and the aligned LLMs, ensuring the interpretability of the model for humans, and maintaining safety and evaluation benchmarks to train robust reward models. In this section, we discuss recent works that tackle these challenges and provide strategies for overcoming them.
[1] Lou, Xingzhou and Yan, Dong and Shen, Wei and Yan, Yuzi and Xie, Jian and Zhang, Junge. Uncertainty-aware reward model: Teaching reward models to know what is unknown. arXiv preprint arXiv:2410.00847. Paper
[2] Yang, Rui and Ding, Ruomeng and Lin, Yong and Zhang, Huan and Zhang, Tong. Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs. arXiv preprint arXiv:2406.10216. Paper
[1] Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong. Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts. arXiv preprint arXiv:2406.12845. Paper
[2] Dorka, Nicolai. Quantile Regression for Distributional Reward Models in RLHF. arXiv preprint arXiv:2409.10164. Paper
[3] Zhang, Yifan and Zhang, Ge and Wu, Yue and Xu, Kangping and Gu, Quanquan. General Preference Modeling with Preference Representations for Aligning Language Models. arXiv preprint arXiv:2410.02197. Paper
[1] Dai, Josef and Pan, Xuehai and Sun, Ruiyang and Ji, Jiaming and Xu, Xinbo and Liu, Mickel and Wang, Yizhou and Yang, Yaodong. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773. Paper
[2] Lu, Ximing and Welleck, Sean and Hessel, Jack and Jiang, Liwei and Qin, Lianhui and West, Peter and Ammanabrolu, Prithviraj and Choi, Yejin. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems. Paper
[3] Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and other. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073. Paper
[4] Ji, Jiaming and Liu, Mickel and Dai, Josef and Pan, Xuehai and Zhang, Chi and Bian, Ce and Chen, Boyuan and Sun, Ruiyang and Wang, Yizhou and Yang, Yaodong. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems. Paper
[5] Mu, Tong and Helyar, Alec and Heidecke, Johannes and Achiam, Joshua and Vallone, Andrea and Kivlichan, Ian and Lin, Molly and Beutel, Alex and Schulman, John and Weng, Lilian. Rule based rewards for language model safety. arXiv preprint arXiv:2411.01111. Paper
[1] Lambert, Nathan and Pyatkin, Valentina and Morrison, Jacob and Miranda, LJ and Lin, Bill Yuchen and Chandu, Khyathi and Dziri, Nouha and Kumar, Sachin and Zick, Tom and Choi, Yejin and others. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787. Paper
[2] Kim, Seungone and Suk, Juyoung and Longpre, Shayne and Lin, Bill Yuchen and Shin, Jamin and Welleck, Sean and Neubig, Graham and Lee, Moontae and Lee, Kyungjae and Seo, Minjoon. Prometheus 2: An open source language model specialized in evaluating other language models. arXiv preprint arXiv:2405.01535. Paper
While effective, RLHF or RLAIF is often mired in complexity due to the challenges of reinforcement learning algorithms and the necessity of an accurately trained reward model. Recent research has turned towards Direct Preference Optimization (DPO), which bypasses the reward model by directly using human preference data to fine-tune LLMs. DPO reframes the objective from reward maximization to preference optimization, and offers a straightforward and potentially more robust pathway for aligning LLM outputs with human expectations. This section delves into the methodologies underpinning DPO, exploring how approaches like SLiC-HF,
[1] Zhao, Yao and Joshi, Rishabh and Liu, Tianqi and Khalman, Misha and Saleh, Mohammad and Liu, Peter J. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425. Paper
[2] Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems. Paper
[3] Wu, Junkang and Xie, Yuexiang and Yang, Zhengyi and Wu, Jiancan and Gao, Jinyang and Ding, Bolin and Wang, Xiang and He, Xiangnan.
[4] Kim, Dahyun and Kim, Yungi and Song, Wonho and Kim, Hyeonwoo and Kim, Yunsu and Kim, Sanghoon and Park, Chanjun. sDPO: Don't Use Your Data All at Once. arXiv preprint arXiv:2403.19270. Paper
[5] Liu, Tianqi and Zhao, Yao and Joshi, Rishabh and Khalman, Misha and Saleh, Mohammad and Liu, Peter J. and Liu, Jialu. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657. Paper
[6] Tang, Yunhao and Guo, Zhaohan Daniel and Zheng, Zeyu and Calandriello, Daniele and Munos, Rémi and Rowland, Mark and Richemond, Pierre Harvey and Valko, Michal and Pires, Bernardo Ávila and Piot, Bilal. Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749. Paper
[7] Richemond, Pierre Harvey and Tang, Yunhao and Guo, Daniel and Calandriello, Daniele and Azar, Mohammad Gheshlaghi and Rafailov, Rafael and Pires, Bernardo Avila and others. Offline Regularised Reinforcement Learning for Large Language Models Alignment. arXiv preprint arXiv:2405.19107. Paper
While the simplicity and efficiency of DPO make it an appealing choice, its practical implementation reveals challenges and opportunities for improvement. This section delves into the safety implications of DPO, particularly in how it handles harmful outputs, and explores DPO variants, which aim to optimize the trade-off between minimizing harmful content and maintaining generative diversity. We reveal studies that highlight the theoretical and practical considerations that define the effectiveness and limitations of DPO-based methods in achieving safe, reliable, and high-interpretability LLMs.
[1] Duan, Shitong and Yi, Xiaoyuan and Zhang, Peng and Lu, Tun and Xie, Xing and Gu, Ning. Negating negatives: Alignment without human positive samples via distributional dispreference optimization. arXiv preprint arXiv:2403.03419. Paper
[2] Zhang, Ruiqi and Lin, Licong and Bai, Yu and Mei, Song. Negative preference optimization: From catastrophic collapse to effective unlearning. arXiv preprint arXiv:2404.05868. Paper
[1] Rosset, Corby and Cheng, Ching-An and Mitra, Arindam and Santacroce, Michael and Awadallah, Ahmed and Xie, Tengyang. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715. Paper
[2] Wu, Yue and Sun, Zhiqing and Yuan, Huizhuo and Ji, Kaixuan and Yang, Yiming and Gu, Quanquan. Self-play preference optimization for language model alignment. arXiv preprint arXiv:2404.00675. Paper
[3] Swamy, Gokul and Dann, Christoph and Kidambi, Rahul and Wu, Zhiwei Steven and Agarwal, Alekh. A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056. Paper
[4] Pal, Arka and Karkhanis, Deep and Dooley, Samuel and Roberts, Manley and Naidu, Siddartha and White, Colin. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228. Paper
[5] Zeng, Yongcheng and Liu, Guoqing and Ma, Weiyu and Yang, Ning and Zhang, Haifeng and Wang, Jun. Token-level Direct Preference Optimization. arXiv preprint arXiv:2404.11999. Paper
[1] Azar, Mohammad Gheshlaghi and Guo, Zhaohan Daniel and Piot, Bilal and Munos, Remi and Rowland, Mark and Valko, Michal and Calandriello, Daniele. A general theoretical paradigm to understand learning from human preferences. International Conference on Artificial Intelligence and Statistics. Paper
[2] Ivison, Hamish and Wang, Yizhong and Liu, Jiacheng and Wu, Zeqiu and Pyatkin, Valentina and Lambert, Nathan and Smith, Noah A. and Choi, Yejin and Hajishirzi, Hannaneh. Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. arXiv preprint arXiv:2406.09279. Paper
[3] Xiong, Wei and Dong, Hanze and Ye, Chenlu and Wang, Ziqi and Zhong, Han and Ji, Heng and Jiang, Nan and Zhang, Tong. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. Forty-first International Conference on Machine Learning. Paper
[4] Saeidi, Amir and Verma, Shivanshu and Baral, Chitta. Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks, 2024. Paper
[5] Xu, Shusheng and Fu, Wei and Gao, Jiaxuan and Ye, Wenjie and Liu, Weilin and Mei, Zhiyu and Wang, Guangju and Yu, Chao and Wu, Yi. Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719. Paper
If you have any questions or suggestions, please feel free to create an issue or send an e-mail to shuhewang@student.unimelb.edu.au
.