Skip to content

Code for the blog post titled "Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning"

Notifications You must be signed in to change notification settings

krafton-ai/Offline-GRPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning

A continual post-training method for enhancing state-of-the-art (SOTA) LLMs for mathematical reasoning using Offline Group Relative Policy Optimization (GRPO).

Introduction

We propose a continual post-training method that can be applied to various reasoning-focused Large Language Models (LLMs) to further improve their performance. Our approach utilizes Offline Reinforcement Learning with Verifiable Rewards (Offline RLVR) to overcome the limitations of traditional on-policy methods. For detailed methodology and experimental results, please refer to our blog post:

  • Blog Post: Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning [KR / EN]
  • HuggingFace: Offline-GRPO Collections [Link]

Why Offline RLVR?

Traditional on-policy RLVR methods have two main limitations:

  1. Slow training speed: Requires generating rollouts for each problem at every training step
  2. Performance bottleneck: Limited by the base LLM's initial capability to generate correct trajectories

Our offline approach addresses these issues by leveraging pre-generated high-quality reasoning trajectories from teacher models, enabling faster training and better performance improvements.

Key Features:

  • Offline RL: Utilizes teacher model rollout trajectories for more efficient training
  • Enhanced GRPO: Addresses the challenge of all-positive reasoning traces with bias term addition
  • Improved Performance: Demonstrates superior results compared to standard SFT and on-policy methods

Installation

You can install dependencies by running the following commands:

conda create -n offline-grpo python=3.10
conda activate offline-grpo
cd src
pip install -r requirements.txt
pip install -e .
cd verl
pip install -e .

If you encounter issues when installing flash-attn, we recommend you to install it here flash-attn. For example, we use this version.

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Repo Structure

This repository includes:

  • src: Codes for training using off-policy reasoning traces. Our main code changes are in src/verl/verl/mix_src.
  • data: Data and code for training and evaluating our method.
  • exp_scripts: Example script to train with offline GRPO.
  • eval_scripts: Evaluation scripts on math and out-of-distribution benchmarks. We use SkyThought for evaluation here.

Our implementation is built on top of the verl framework and supports plug-and-play integration with off-policy traces from models such as DeepSeek-R1.


Usage

Data Download

We provide pre-processed datasets for training:

  • Download the OpenThought3 filtered math dataset: Google Drive
  • The dataset contains positive and negative reasoning trajectories from teacher models

Training

We provide an example script for training with offline GRPO on the prepared data:

cd exp_scripts
bash train_openthinker3.sh

This script launches multi-GPU training for the target model using our offline GRPO method.


Credits

This project heavily builds upon the excellent LUFFY repository. We extend our sincere gratitude to the original LUFFY authors for providing:

  • Core Framework: The foundational reinforcement learning framework
  • GRPO Implementation: The Group Relative Policy Optimization algorithm implementation
  • Training Infrastructure: Multi-GPU training setup and data processing pipelines

Our contributions include modifications to handle only off-policy reasoning traces and enhanced data processing for continual post-training scenarios. The majority of the codebase, especially the core training loop and infrastructure, comes from the original LUFFY implementation.

We also acknowledge the use of:

  • veRL for reinforcement learning infrastructure
  • deepscaler for scaling utilities
  • vLLM for efficient inference
  • Math-Verify for math reasoning evaluation

Citation

@inproceedings{krafton2025encontinualposttrainingof,
  author = {KRAFTON,  and SKT, },
  title = {[EN] Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning},
  abstract = {In this post, we explore a new approach to enhancing the reasoning capabilities of LLMs through continual post-training. While pre-training equips LLMs with broad linguistic knowledge, it often falls short in complex reasoning tasks like math or code. Recent models have shown that Reinforcement Learning with Verifiable Rewards (RLVR) can help bridge this gap, but existing methods rely on slow and limited online training. We propose an offline alternative using teacher-generated trajectories and introduce a novel variant of Group Relative Policy Optimization (GRPO) that better captures high-quality reasoning traces—even when all outputs are positive. Our experiments on mathematical reasoning show that this method leads to consistent improvements.},
  year = {2025},
  date = {July 28, 2025},
  url  = {https://krafton-ai.github.io/blog/llm_post_training_en/}
}

About

Code for the blog post titled "Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published