Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning

A continual post-training method for enhancing state-of-the-art (SOTA) LLMs for mathematical reasoning using Offline Group Relative Policy Optimization (GRPO).

Introduction

We propose a continual post-training method that can be applied to various reasoning-focused Large Language Models (LLMs) to further improve their performance. Our approach utilizes Offline Reinforcement Learning with Verifiable Rewards (Offline RLVR) to overcome the limitations of traditional on-policy methods. For detailed methodology and experimental results, please refer to our blog post:

Blog Post: Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning [KR / EN]
HuggingFace: Offline-GRPO Collections [Link]

Why Offline RLVR?

Traditional on-policy RLVR methods have two main limitations:

Slow training speed: Requires generating rollouts for each problem at every training step
Performance bottleneck: Limited by the base LLM's initial capability to generate correct trajectories

Our offline approach addresses these issues by leveraging pre-generated high-quality reasoning trajectories from teacher models, enabling faster training and better performance improvements.

Key Features:

Offline RL: Utilizes teacher model rollout trajectories for more efficient training
Enhanced GRPO: Addresses the challenge of all-positive reasoning traces with bias term addition
Improved Performance: Demonstrates superior results compared to standard SFT and on-policy methods

Installation

You can install dependencies by running the following commands:

conda create -n offline-grpo python=3.10
conda activate offline-grpo
cd src
pip install -r requirements.txt
pip install -e .
cd verl
pip install -e .

If you encounter issues when installing flash-attn, we recommend you to install it here flash-attn. For example, we use this version.

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.3+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Repo Structure

This repository includes:

src: Codes for training using off-policy reasoning traces. Our main code changes are in src/verl/verl/mix_src.
data: Data and code for training and evaluating our method.
exp_scripts: Example script to train with offline GRPO.
eval_scripts: Evaluation scripts on math and out-of-distribution benchmarks. We use SkyThought for evaluation here.

Our implementation is built on top of the verl framework and supports plug-and-play integration with off-policy traces from models such as DeepSeek-R1.

Usage

Data Download

We provide pre-processed datasets for training:

Download the OpenThought3 filtered math dataset: Google Drive
The dataset contains positive and negative reasoning trajectories from teacher models

Training

We provide an example script for training with offline GRPO on the prepared data:

cd exp_scripts
bash train_openthinker3.sh

This script launches multi-GPU training for the target model using our offline GRPO method.

Credits

This project heavily builds upon the excellent LUFFY repository. We extend our sincere gratitude to the original LUFFY authors for providing:

Core Framework: The foundational reinforcement learning framework
GRPO Implementation: The Group Relative Policy Optimization algorithm implementation
Training Infrastructure: Multi-GPU training setup and data processing pipelines

Our contributions include modifications to handle only off-policy reasoning traces and enhanced data processing for continual post-training scenarios. The majority of the codebase, especially the core training loop and infrastructure, comes from the original LUFFY implementation.

We also acknowledge the use of:

veRL for reinforcement learning infrastructure
deepscaler for scaling utilities
vLLM for efficient inference
Math-Verify for math reasoning evaluation

Citation

@inproceedings{krafton2025encontinualposttrainingof,
  author = {KRAFTON,  and SKT, },
  title = {[EN] Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning},
  abstract = {In this post, we explore a new approach to enhancing the reasoning capabilities of LLMs through continual post-training. While pre-training equips LLMs with broad linguistic knowledge, it often falls short in complex reasoning tasks like math or code. Recent models have shown that Reinforcement Learning with Verifiable Rewards (RLVR) can help bridge this gap, but existing methods rely on slow and limited online training. We propose an offline alternative using teacher-generated trajectories and introduce a novel variant of Group Relative Policy Optimization (GRPO) that better captures high-quality reasoning traces—even when all outputs are positive. Our experiments on mathematical reasoning show that this method leads to consistent improvements.},
  year = {2025},
  date = {July 28, 2025},
  url  = {https://krafton-ai.github.io/blog/llm_post_training_en/}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning

Introduction

Why Offline RLVR?

Key Features:

Installation

Repo Structure

Usage

Data Download

Training

Credits

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
eval_scripts		eval_scripts
exp_scripts		exp_scripts
src		src
README.md		README.md

krafton-ai/Offline-GRPO

Folders and files

Latest commit

History

Repository files navigation

Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning

Introduction

Why Offline RLVR?

Key Features:

Installation

Repo Structure

Usage

Data Download

Training

Credits

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages