RLHF + PPO: Aligning LLMs with Human Feedback

RLHF is a two-stage process designed to fine-tune language models to align with human preferences. It involves:

1. Reward Model Training

1.1 Collect preference data

Human annotators compare pairs (or triples) of completions and determine the preference ranking based on criteria like usefulness, clarity, and tone. In this project, we use imdb dataset https://huggingface.co/datasets/stanfordnlp/imdb.

1.2 Train the reward model

A pretrained encoder or LLM is augmented with a scalar regression head.
For each pair (prompt, y_winner, y_loser), define a cross-entropy loss:

2. Policy Model Training via PPO

Once the reward model is trained:

2.1 Generate rollouts

Sample prompts from a dataset, generate completions using the current policy model, and score each with the reward model.

2.2 Add KL penalty

To prevent the policy model from drifting too far from the reference model, include a per-token KL-divergence penalty in the reward:

2.3 Compute advantages and apply PPO

Use a critic value function netword V(x) to estimate value, and compute advantages.
Optimize policy using the clipped surrogate objective.

2.4 Update critic (value function)

Minimize mean-squared TD-error.

2.5 Pseudocode and Implementation Details

Value network is implemented as an additional head on top of the LLM backbone
Handles different tokenizers of the reward model and the policy/value model
Use a memory-efficient AdamW optimizer that sets states on CPU
Training is done on one Nvidia 3060 GPU

3. Pipeline Overview

Stage	Input	Output
Reward	Preference pairs (prompt, winner, loser)	Trained reward model `R_ϕ`
Policy	Prompts, `R_ϕ`, and reference policy `π_SFT`	Aligned policy `π_RL` via PPO

Reward model stays fixed during policy training.
Policy model is updated using PPO and a KL penalty to balance alignment and distributional faithfulness.
Critic stabilizes training with advantage estimation.

4. Reward Model and Policy Model Training

4.1 Update training parameters in

src/config/ppo.yaml

4.2 Train the reward model via

python run_reward_model_trainer.py

4.3 Train the policy network via

python run_ppo_trainer.py

5. Training Results

5.1 Reward Model Training Tensorboard Reward model loss

5.2 Policy and Value Models (PPO) Training Tensorboard

PPO loss Value network TD error Generated sequences' average rewards during training Average generated sequences' rewards during evaluation

6. References

PPO paper: Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347

InstructGPT paper: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, pp.27730-27744. https://arxiv.org/abs/2203.02155

Generalized Advantage Estimation paper: Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. High-dimensional continuous control using generalized advantage estimation. https://arxiv.org/abs/1506.02438

https://spinningup.openai.com/en/latest/algorithms/ppo.html

https://arxiv.org/pdf/2403.17031

https://arxiv.org/pdf/2203.02155

https://github.com/ash80/RLHF_in_notebooks

https://www.youtube.com/watch?v=11M_kfuPJ5I

https://github.com/hkproj/rlhf-ppo

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_ppo_trainer.py		run_ppo_trainer.py
run_reward_model_trainer.py		run_reward_model_trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RLHF + PPO: Aligning LLMs with Human Feedback

1. Reward Model Training

2. Policy Model Training via PPO

3. Pipeline Overview

4. Reward Model and Policy Model Training

5. Training Results

6. References

About

Uh oh!

Releases

Packages

Languages

License

yuanxuns/RLHF-with-PPO-from-Scratch-in-Pytorch

Folders and files

Latest commit

History

Repository files navigation

RLHF + PPO: Aligning LLMs with Human Feedback

1. Reward Model Training

2. Policy Model Training via PPO

3. Pipeline Overview

4. Reward Model and Policy Model Training

5. Training Results

6. References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages