CMU 11868 LLM Systems Spring 2025
In this assignment, you will learn to use VERL (Volcano Engine Reinforcement Learning), a flexible and efficient reinforcement learning framework designed for large language models. You'll implement a basic RLHF (Reinforcement Learning from Human Feedback) pipeline using VERL to fine-tune a small language model for harmless and helpful responses.
After completing this assignment, you will be able to:
- Understand the basic concepts of RLHF and its application to LLMs
- Set up and configure the VERL framework
- Implement a simple reward model for evaluating LLM outputs
- Use VERL's PPO implementation to fine-tune a language model
- Evaluate the performance improvements after RLHF training
Before starting the implementation, familiarize yourself with the core concepts:
Essential Reading:
- VERL Documentation - Framework overview and API reference
- HybridFlow Paper - The research foundation behind VERL
- RLHF Tutorial - Comprehensive introduction to RLHF methodology
- PPO Algorithm Explained - Understanding the RL algorithm
Key Concepts:
- RLHF Pipeline: Human preference data → Reward model training → Policy optimization with PPO
- VERL's Hybrid Architecture: Separation of generation (inference) and training phases for scalability
- PPO Algorithm: Policy gradient method with clipping for stable training
- Reward Model: Neural network trained to predict human preferences
git clone https://github.com/llmsystem/llmsys_f25_hw7.git
cd llmsys_f25_hw7
conda create -n llmsys_hw7 python=3.9
conda activate llmsys_hw7
# Install VERL and other dependencies
pip install -r requirements.txt
This assignment uses the Anthropic/hh-rlhf dataset from Hugging Face, which contains real human preference data for helpfulness and harmlessness. This is the same dataset used in many RLHF research papers.
The training data is automatically downloaded and prepared when you first run the training scripts. You can also prepare it manually:
# Download and prepare the dataset
python scripts/prepare_data.py --dataset Anthropic/hh-rlhf --output_dir data --max_samples 10000
1.1 Complete the loss implementation in src/reward_model.py
:
In this problem, you
- Implement the ranking loss for the reward training in
compute_loss
. - The model is based on a pre-trained transformer (DistilBERT)
1.2 Train your reward model using the provided preference data:
python scripts/train_reward_model.py
2.1 Complete the RLHF trainer implementation in src/rlhf_trainer.py
:
- Implement the
VERLTrainer
class using VERL's PPO implementation
2.2 Run the RLHF training process:
python scripts/run_rlhf.py --model_name gpt2 --config src/config.py
3.1 Run comprehensive evaluation:
python scripts/evaluate.py --base_model gpt2 --rlhf_model outputs/rlhf_model --config src/config.py
Run the provided tests to verify your implementation:
python -m pytest tests/ -v
-
Code Submission:
- Ensure all code runs without errors
- Include all required output files
- Test your implementation with
python -m pytest tests/
-
Create Submission Archive:
# Remove large model files but keep small checkpoints find outputs/ -name "*.bin" -size +100M -delete # Create submission zip zip -r assignment7_[your_andrew_id].zip . -x "*.git*" "*__pycache__*" "*.pyc"