tiny-grpo

A minimal, hackable implementation of Group Relative Policy Optimization (GRPO).

Goal: Implementation of GRPO for training a local llama-3.2-3b model using RL. Focus on understanding GRPO algorithm. Run everything locally with a single RTX A6000 node and Axolotl Docker image on RunPod.

This project is inspired by and builds upon open-thought/tiny-grpo.

Updates

[2025-07-19] (in-progress) Supporting GPG: Group Policy Gradient. Some configurations can be found in verl/gpg. The main task here is to update reward modeling, grouping advantages, remove KL divergence.

Following GPG/open-r1/src/open_r1/gpg_trainer.py, we implemented inverse_alpha, avoiding divide by zero when n_valid_samples = 0.
(in-progress) Resampling based on ratio of valid samples.

[2025-07-16] Upgrading transformers 4.48.1 -> 4.53.2. Starting from transformers>=4.50, the library modularized model support (see huggingface/transformers (release v4.50.0)). Switched to using AutoModelForCausalLM.

To load LLaMA models, you must explicitly install the llama extra. Perform sanity check with python -c "from transformers.models.llama import LlamaForCausalLM; print(LlamaForCausalLM)"
See lamng3/tiny-grpo (issue #1) for more details.

[2025-07-15] Supporting DAPO, following this huggingface/trl#3130 (comment).

Token-level Loss is already implemented as masked_mean.
Clip-Higher is implemented following huggingface/trl#3118 (comment). DAPO recommends using clip_eps_low = 0.2 and clip_eps_high = 0.28.
Dynamic Sampling (in progress) is skipped in huggingface/trl#3130 (comment) because of inefficiency. The configurations can be found in verl/dapo.
Overlong Filtering is skipped (see the reason in huggingface/trl#3130 (comment) and verl/dapo).
Soft Overlong Punishment is implemented following huggingface/trl#3130 (comment), with L_cache = 256 for L_max = 1024, inspired from verl/dapo, where L_cache is overlong_buffer.

[2025-07-12] Supporting Dr.GRPO, with modifications in calculating {masked_mean with constant generation max tokens (512 from oat/oat/args.py)} and {group_advantage without std bias}, following understand-r1-zero/train_zero_math.py.

Setup

Spin up a RunPod instance

choose 1 RTX A6000 ($0.49/hr) on Axolotl Docker image.

Create conda env

conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
conda create --name grpo python=3.12 -y
source ~/.bashrc # or ~/.zshrc if you're using zsh
conda init
conda activate grpo

Install dependencies

cd tiny-grpo
pip install -r requirements.txt
pip install hf_transfer
pip install flash-attn --no-build-isolation

HuggingFace and WandB login

huggingface-cli login
wandb login

Play with the source in train.py

python train.py

Pushing code to GitHub

<!-- generate SSH key (if not yet) -->
ssh-keygen -t ed25519 -C "your_email@example.com"

<!-- add public key to github -->
cat ~/.ssh/id_ed25519.pub

<!-- change repo remote to SSH -->
git remote set-url origin git@github.com:<username>/<repo_name>.git

<!-- test connection -->
ssh -T git@github.com

Transfer file from local computer to RunPod instance

scp -i ~/.ssh/id_ed25519 -P <port> <local_file_path> root@<host_name>:<remote_destination_path>

Training Results

The run stopped at 9K steps due to insufficient storage memory. To prevent this, consider doubling the storage capacity or offloading checkpoints to temporary storage.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
data		data
results		results
.gitignore		.gitignore
README.md		README.md
loss.py		loss.py
replay_buffer.py		replay_buffer.py
requirements.txt		requirements.txt
reward.py		reward.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

tiny-grpo

Updates

Setup

Training Results

References

About

Uh oh!

Releases

Packages

Languages

lamng3/tiny-grpo

Folders and files

Latest commit

History

Repository files navigation

tiny-grpo

Updates

Setup

Training Results

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages