Skip to content

lamng3/tiny-grpo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tiny-grpo

A minimal, hackable implementation of Group Relative Policy Optimization (GRPO).

Goal: Implementation of GRPO for training a local llama-3.2-3b model using RL. Focus on understanding GRPO algorithm. Run everything locally with a single RTX A6000 node and Axolotl Docker image on RunPod.

This project is inspired by and builds upon open-thought/tiny-grpo.

Updates

[2025-07-19] (in-progress) Supporting GPG: Group Policy Gradient. Some configurations can be found in verl/gpg. The main task here is to update reward modeling, grouping advantages, remove KL divergence.

[2025-07-16] Upgrading transformers 4.48.1 -> 4.53.2. Starting from transformers>=4.50, the library modularized model support (see huggingface/transformers (release v4.50.0)). Switched to using AutoModelForCausalLM.

  • To load LLaMA models, you must explicitly install the llama extra. Perform sanity check with python -c "from transformers.models.llama import LlamaForCausalLM; print(LlamaForCausalLM)"

  • See lamng3/tiny-grpo (issue #1) for more details.

[2025-07-15] Supporting DAPO, following this huggingface/trl#3130 (comment).

[2025-07-12] Supporting Dr.GRPO, with modifications in calculating {masked_mean with constant generation max tokens (512 from oat/oat/args.py)} and {group_advantage without std bias}, following understand-r1-zero/train_zero_math.py.

Setup

  1. Spin up a RunPod instance
choose 1 RTX A6000 ($0.49/hr) on Axolotl Docker image.
  1. Create conda env
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
conda create --name grpo python=3.12 -y
source ~/.bashrc # or ~/.zshrc if you're using zsh
conda init
conda activate grpo
  1. Install dependencies
cd tiny-grpo
pip install -r requirements.txt
pip install hf_transfer
pip install flash-attn --no-build-isolation
  1. HuggingFace and WandB login
huggingface-cli login
wandb login
  1. Play with the source in train.py
python train.py
  1. Pushing code to GitHub
<!-- generate SSH key (if not yet) -->
ssh-keygen -t ed25519 -C "your_email@example.com"

<!-- add public key to github -->
cat ~/.ssh/id_ed25519.pub

<!-- change repo remote to SSH -->
git remote set-url origin git@github.com:<username>/<repo_name>.git

<!-- test connection -->
ssh -T git@github.com
  1. Transfer file from local computer to RunPod instance
scp -i ~/.ssh/id_ed25519 -P <port> <local_file_path> root@<host_name>:<remote_destination_path>

Training Results

The run stopped at 9K steps due to insufficient storage memory. To prevent this, consider doubling the storage capacity or offloading checkpoints to temporary storage.

Training Returns

References

About

minimal hackable GRPO implementation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages