RLHF is a two-stage process designed to fine-tune language models to align with human preferences. It involves:
1.1 Collect preference data
- Human annotators compare pairs (or triples) of completions and determine the preference ranking based on criteria like usefulness, clarity, and tone. In this project, we use imdb dataset https://huggingface.co/datasets/stanfordnlp/imdb.
1.2 Train the reward model
- A pretrained encoder or LLM is augmented with a scalar regression head.
- For each pair
(prompt, y_winner, y_loser)
, define a cross-entropy loss:
Once the reward model is trained:
2.1 Generate rollouts
- Sample prompts from a dataset, generate completions using the current policy model, and score each with the reward model.
2.2 Add KL penalty
- To prevent the policy model from drifting too far from the reference model, include a per-token KL-divergence penalty in the reward:
2.3 Compute advantages and apply PPO
- Use a critic value function netword
V(x)
to estimate value, and compute advantages. - Optimize policy using the clipped surrogate objective.
2.4 Update critic (value function)
- Minimize mean-squared TD-error.
2.5 Pseudocode and Implementation Details
-
Value network is implemented as an additional head on top of the LLM backbone
-
Handles different tokenizers of the reward model and the policy/value model
-
Use a memory-efficient AdamW optimizer that sets states on CPU
-
Training is done on one Nvidia 3060 GPU
Stage | Input | Output |
---|---|---|
Reward | Preference pairs (prompt, winner, loser) | Trained reward model R_ϕ |
Policy | Prompts, R_ϕ , and reference policy π_SFT |
Aligned policy π_RL via PPO |
- Reward model stays fixed during policy training.
- Policy model is updated using PPO and a KL penalty to balance alignment and distributional faithfulness.
- Critic stabilizes training with advantage estimation.
4.1 Update training parameters in
src/config/ppo.yaml
4.2 Train the reward model via
python run_reward_model_trainer.py
4.3 Train the policy network via
python run_ppo_trainer.py
5.1 Reward Model Training Tensorboard
Reward model loss
5.2 Policy and Value Models (PPO) Training Tensorboard
PPO loss
Value network TD error
Generated sequences' average rewards during training
Average generated sequences' rewards during evaluation
PPO paper: Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347
InstructGPT paper: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, pp.27730-27744. https://arxiv.org/abs/2203.02155
Generalized Advantage Estimation paper: Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. High-dimensional continuous control using generalized advantage estimation. https://arxiv.org/abs/1506.02438
https://spinningup.openai.com/en/latest/algorithms/ppo.html
https://arxiv.org/pdf/2403.17031
https://arxiv.org/pdf/2203.02155
https://github.com/ash80/RLHF_in_notebooks