Group Relative Policy Optimization (GRPO) is an algorithm proposed by Deepseek for training large language models with reinforcement learning. This repository aggregates and refactors four distinct implementations of GRPO, each demonstrating different approaches to the core algorithm while sharing common principles.
The core GRPO algorithm follows these steps:
- For each training step, randomly sample
$N$ questions$q_1, q_2, \cdots, q_N$ . - For each question
$q_i$ , sample$M$ answers$a_{i,1}, a_{i,2}, \cdots, a_{i,M}$ . - Compute the reward
$r_{i,j}$ for each answer$a_{i,j}$ . - Compute group statistics for each question
$q_i$ :
-
For each token
$t$ in answer$a_{i,j}$ , compute advantage:$$A_{i,j}[t] \leftarrow \frac{r_{i,j} - \mu_i}{\sigma_i}$$ -
Update policy using PPO surrogate objective:
$$\nabla_\theta \log \pi_\theta(a_{i,j}[t]) \cdot A_{i,j}[t]$$
We provide four refactored implementations of GRPO, each with a different focus and design:
An implementation from nanoAhaMoment, that separates each step of the GRPO loop into distinct components. It uses a rule-based reward function for a Countdown task and integrates with vLLM for efficient generation.
- Modular pipeline with separated components
- vLLM integration for efficient generation
- DeepSpeed training backend
- Format:
<think>...</think>\n<answer>...</answer>
- Rule-based reward functions for Countdown tasks
2. GRPO:Zero
An implementation from GRPO-Zero, that uses a separate server for the reference model to offload computation. It uses the GSM8K dataset and a combined reward for correctness and format.
- Qwen2.5-3B-Instruct base model
- Countdown-Tasks-3to4 dataset
- Simplified training workflow
- Reward Function: Combined reward for correctness and format
3. Simple GRPO
An implementation from Simple GRPO, that uses DeepSpeed for training and a reference model server. It features a policy gradient loss with KL penalty and reward normalization within groups.
- Reference model server architecture
- GSM8K dataset
- KL divergence penalty term
- Per-token advantage calculation
- Distributed training support
- Loss Calculation:
loss = -(policy_ratio * advantage - beta * kl_divergence)
An implementation from "The LM Book" by Andriy Burkov, that demonstrates the core GRPO algorithm step-by-step. It uses a copy of the reference model and performs multiple updates per batch.
- Periodic reference model updates
- Multiple updates per batch (μ-PPO)
- Comprehensive reward decomposition
- Memory optimization techniques
- Reward Function: Combined reward for correctness and format
All implementations share the following steps:
- Group Sampling: For each prompt, multiple completions are generated to form a group.
- Reward Calculation: Each completion receives a scalar reward, typically combining correctness and format adherence.
- Advantage Normalization: Within each group, rewards are normalized to have zero mean and unit variance to form advantages.
- Policy Update: The policy is updated using a policy gradient method (with or without clipping) and often includes a KL penalty to prevent deviation from a reference policy.
The implementations have different variations in the following aspects:
-
Reward Functions: The implementations use different reward functions tailored to the task and different weights for format and correctness.
- Format Reward: Enforces XML-style reasoning structure
- Correctness Reward: Validates solution accuracy
- Combined Reward:
R_total = R_format + R_correctness
-
Reference Model Handling: Some implementations use a fixed reference model (via a separate server or a frozen copy) while others update the reference model periodically.
-
Training Framework: The implementations use different training frameworks (e.g., DeepSpeed, pure PyTorch) and optimization techniques (e.g., gradient checkpointing).
-
Batching and Generation: The approaches to generation (vLLM, Hugging Face transformers) and batching vary.