Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO) is an algorithm proposed by Deepseek for training large language models with reinforcement learning. This repository aggregates and refactors four distinct implementations of GRPO, each demonstrating different approaches to the core algorithm while sharing common principles.

Algorithm

The core GRPO algorithm follows these steps:

For each training step, randomly sample $N$ questions $q_1, q_2, \cdots, q_N$.
For each question $q_i$, sample $M$ answers $a_{i,1}, a_{i,2}, \cdots, a_{i,M}$.
Compute the reward $r_{i,j}$ for each answer $a_{i,j}$.
Compute group statistics for each question $q_i$:

$$ \begin{alignedat}{2} &\mu_i &&\leftarrow \text{mean}(r_{i,1}, r_{i,2}, \cdots, r_{i,M}) \\ &\sigma_i &&\leftarrow \text{std}(r_{i,1}, r_{i,2}, \cdots, r_{i,M}) \end{alignedat} $$

For each token $t$ in answer $a_{i,j}$, compute advantage: $$A_{i,j}[t] \leftarrow \frac{r_{i,j} - \mu_i}{\sigma_i}$$
Update policy using PPO surrogate objective: $$\nabla_\theta \log \pi_\theta(a_{i,j}[t]) \cdot A_{i,j}[t]$$

Implementations

We provide four refactored implementations of GRPO, each with a different focus and design:

1. nanoAhaMoment

An implementation from nanoAhaMoment, that separates each step of the GRPO loop into distinct components. It uses a rule-based reward function for a Countdown task and integrates with vLLM for efficient generation.

Modular pipeline with separated components
vLLM integration for efficient generation
DeepSpeed training backend
Format: <think>...</think>\n<answer>...</answer>
Rule-based reward functions for Countdown tasks

2. GRPO:Zero

An implementation from GRPO-Zero, that uses a separate server for the reference model to offload computation. It uses the GSM8K dataset and a combined reward for correctness and format.

Qwen2.5-3B-Instruct base model
Countdown-Tasks-3to4 dataset
Simplified training workflow
Reward Function: Combined reward for correctness and format

3. Simple GRPO

An implementation from Simple GRPO, that uses DeepSpeed for training and a reference model server. It features a policy gradient loss with KL penalty and reward normalization within groups.

Reference model server architecture
GSM8K dataset
KL divergence penalty term
Per-token advantage calculation
Distributed training support
Loss Calculation: loss = -(policy_ratio * advantage - beta * kl_divergence)

4. GRPO from Scratch

An implementation from "The LM Book" by Andriy Burkov, that demonstrates the core GRPO algorithm step-by-step. It uses a copy of the reference model and performs multiple updates per batch.

Periodic reference model updates
Multiple updates per batch (μ-PPO)
Comprehensive reward decomposition
Memory optimization techniques
Reward Function: Combined reward for correctness and format

Common Components

All implementations share the following steps:

Group Sampling: For each prompt, multiple completions are generated to form a group.
Reward Calculation: Each completion receives a scalar reward, typically combining correctness and format adherence.
Advantage Normalization: Within each group, rewards are normalized to have zero mean and unit variance to form advantages.
Policy Update: The policy is updated using a policy gradient method (with or without clipping) and often includes a KL penalty to prevent deviation from a reference policy.

Variations

The implementations have different variations in the following aspects:

Reward Functions: The implementations use different reward functions tailored to the task and different weights for format and correctness.
- Format Reward: Enforces XML-style reasoning structure
- Correctness Reward: Validates solution accuracy
- Combined Reward: R_total = R_format + R_correctness
Reference Model Handling: Some implementations use a fixed reference model (via a separate server or a frozen copy) while others update the reference model periodically.
Training Framework: The implementations use different training frameworks (e.g., DeepSpeed, pure PyTorch) and optimization techniques (e.g., gradient checkpointing).
Batching and Generation: The approaches to generation (vLLM, Hugging Face transformers) and batching vary.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
src/grpo		src/grpo
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Group Relative Policy Optimization (GRPO)

Algorithm

Implementations

1. nanoAhaMoment

2. GRPO:Zero

3. Simple GRPO

4. GRPO from Scratch

Common Components

Variations

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

avnlp/grpo

Folders and files

Latest commit

History

Repository files navigation

Group Relative Policy Optimization (GRPO)

Algorithm

Implementations

1. nanoAhaMoment

2. GRPO:Zero

3. Simple GRPO

4. GRPO from Scratch

Common Components

Variations

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages