Skip to content

yuanxuns/RLHF-with-PPO-from-Scratch-in-Pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RLHF + PPO: Aligning LLMs with Human Feedback

RLHF is a two-stage process designed to fine-tune language models to align with human preferences. It involves:

1. Reward Model Training

1.1 Collect preference data

1.2 Train the reward model

  • A pretrained encoder or LLM is augmented with a scalar regression head.
  • For each pair (prompt, y_winner, y_loser), define a cross-entropy loss:

image

2. Policy Model Training via PPO

Once the reward model is trained:

2.1 Generate rollouts

  • Sample prompts from a dataset, generate completions using the current policy model, and score each with the reward model.

2.2 Add KL penalty

  • To prevent the policy model from drifting too far from the reference model, include a per-token KL-divergence penalty in the reward:

image

2.3 Compute advantages and apply PPO

  • Use a critic value function netword V(x) to estimate value, and compute advantages.
  • Optimize policy using the clipped surrogate objective.

Screenshot 2025-07-15 at 2 58 47 PM

2.4 Update critic (value function)

  • Minimize mean-squared TD-error.

2.5 Pseudocode and Implementation Details

image

  • Value network is implemented as an additional head on top of the LLM backbone

  • Handles different tokenizers of the reward model and the policy/value model

  • Use a memory-efficient AdamW optimizer that sets states on CPU

  • Training is done on one Nvidia 3060 GPU


3. Pipeline Overview

Stage Input Output
Reward Preference pairs (prompt, winner, loser) Trained reward model R_ϕ
Policy Prompts, R_ϕ, and reference policy π_SFT Aligned policy π_RL via PPO
  • Reward model stays fixed during policy training.
  • Policy model is updated using PPO and a KL penalty to balance alignment and distributional faithfulness.
  • Critic stabilizes training with advantage estimation.

4. Reward Model and Policy Model Training

4.1 Update training parameters in

src/config/ppo.yaml

4.2 Train the reward model via

python run_reward_model_trainer.py

4.3 Train the policy network via

python run_ppo_trainer.py

5. Training Results

5.1 Reward Model Training Tensorboard Reward model loss rm_loss

5.2 Policy and Value Models (PPO) Training Tensorboard

PPO loss training_loss Value network TD error td_error Generated sequences' average rewards during training avg_rewards Average generated sequences' rewards during evaluation eval_avg_rewards

6. References

PPO paper: Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347

InstructGPT paper: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, pp.27730-27744. https://arxiv.org/abs/2203.02155

Generalized Advantage Estimation paper: Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. High-dimensional continuous control using generalized advantage estimation. https://arxiv.org/abs/1506.02438

https://spinningup.openai.com/en/latest/algorithms/ppo.html

https://arxiv.org/pdf/2403.17031

https://arxiv.org/pdf/2203.02155

https://github.com/ash80/RLHF_in_notebooks

https://www.youtube.com/watch?v=11M_kfuPJ5I

https://github.com/hkproj/rlhf-ppo

About

A pytorch impementation of RLHF with PPO from scratch.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages