GRPO-CoT: Structure-Aware Reinforcement Learning on Qwen2.5-3B (in the spirit of DeepSeek)

⚠️ Github is having problems opening the ipynb file. Please refer to this Colab Notebook for the training notebook. Thanks!

Pretrained checkpoints are available on Hugging Face:

Merged 16-bit (drop-in Transformers): srikar-v05/Qwen2.5-3B-GRPO-16bit
LoRA adapters (PEFT/composable): srikar-v05/Qwen2.5-3B-GRPO-LoRA

Both variants were trained with GRPO on GSM8K using an XML reasoning schema, in the spirit of DeepSeek’s GRPO-based reasoning training.

A lightweight, reproducible pipeline that fine-tunes Qwen2.5-3B-Instruct with LoRA and trains it with GRPO (Group Relative Policy Optimization) on GSM8K, enforcing an XML reasoning schema for easy evaluation—conceptually similar to DeepSeek’s GRPO-based reasoning training.

TL;DR

Base model: Qwen/Qwen2.5-3B-Instruct
Method: 4-bit QLoRA + GRPO (online RL with multi-objective rewards)
Runtime: Unsloth for efficient LoRA training; vLLM for fast candidate generation during RL
Task/data: GSM8K (grade-school math)
Output schema: XML tags <reasoning> and <answer> to make grading trivial
Artifacts: Save LoRA, merge to 16-bit, and push to Hub

Why “similar to DeepSeek”? We adopt GRPO (critic-free, group-baseline PPO variant) to improve reasoning, the same RL algorithm family DeepSeek introduced in DeepSeekMath and later used in DeepSeek-R1’s reasoning-focused training.

1) Background & Motivation

GRPO for reasoning. DeepSeek introduced GRPO, which replaces the value function with a group baseline over multiple samples from the same prompt—cutting memory and complexity while keeping PPO-style stability. This was shown to boost math reasoning and later underpinned R1’s reasoning training.
DeepSeek-R1 pipeline. R1-Zero was trained purely with RL (no SFT); R1 added multi-stage training and RL to improve readability and performance—establishing a strong precedent for RL-driven reasoning. Our project is inspired by this approach (not an exact replica). (arXiv)
Why LoRA / QLoRA. LoRA adapts only small rank-decomposed matrices—massively reducing trainable params. QLoRA enables 4-bit fine-tuning while preserving 16-bit quality, making the project feasible on modest GPUs. (arXiv)
Why Unsloth & vLLM. Unsloth speeds up LoRA/QLoRA fine-tuning; vLLM provides high-throughput generation that plugs into TRL’s online RL (GRPO) loop. (GitHub, VLLM Documentation)

2) What This Repo/Notebook Does

Loads Qwen2.5-3B-Instruct (3B) and wraps it with LoRA adapters for attention & MLP projections. (Hugging Face)
Prepares GSM8K with a system prompt that enforces an XML schema. (arXiv)
Defines five reward functions:
- XML structure (strict/soft), XML token-shape (count), integer-answer heuristic, and exact-match correctness.
Runs TRL’s GRPOTrainer with vLLM to sample multiple generations per prompt (group) and train the policy using group-relative advantages + KL to a reference policy—a GRPO hallmark. (Hugging Face)
Saves the LoRA, evaluates before/after, merges to 16-bit, and optionally pushes to the Hub for easy reuse.

The overall recipe mirrors the GRPO-driven reasoning emphasis in DeepSeekMath and R1: online RL, multi-candidate sampling, group-relative baselining, and KL regularization to a reference policy.

3) Environment & Dependencies

Unsloth (fast LoRA/QLoRA fine-tuning), vLLM (fast generation), TRL (GRPO trainer), Transformers, PEFT, bitsandbytes. (GitHub, VLLM Documentation, Hugging Face)
Quantized LoRA (QLoRA) relies on 4-bit NF4 / double-quantization with bitsandbytes. (arXiv, Hugging Face)

The notebook includes a Colab-aware installer and GPU checks.

4) Data & Prompting

GSM8K: 8.5k grade-school math word problems; we parse the gold label after #### and wrap prompts in a system schema enforcing <reasoning>/<answer>. (arXiv, Hugging Face)
Why XML schema?
- Easy to parse & grade.
- Lets us separate CoT (reasoning) from the final answer for targeted rewards.

5) Rewards (Multi-Objective)

XML structure (strict / soft) — pushes completions to respect the schema.
XML token-shape counter — stabilizes early training by rewarding partial structural compliance.
Integer-answer check — cheap sanity for numeric tasks.
Exact-match correctness — aligns the <answer> text to the gold label.

Combining structure + outcome signals is a lightweight proxy for outcome/process rewards that DeepSeek explored at scale; we keep it simple but schema-aligned.

6) Training with GRPO (via TRL + vLLM)

Online RL: generate multiple candidates per prompt with vLLM; compute group-relative advantages; update the policy while regularizing to a reference policy (KL). (Hugging Face)
No critic/value model: GRPO uses the group mean as a baseline, reducing memory vs PPO—exactly the DeepSeek idea.
Config highlights:
- num_generations (group size), max_completion_length, conservative LR, cosine schedule, gradient clipping, checkpointing.

For context on GRPO’s design and usage within TRL and vLLM, see the TRL GRPO docs and vLLM-TRL integration notes. (Hugging Face, VLLM Documentation)

7) Inference, A/B Checks, and Publishing

Before/after: Query the model (“How many r’s in strawberry?”) pre-adapter vs. with the saved LoRA to verify the adapter actually moves behavior.
Save & share:
- Save LoRA (adapter-only) for PEFT workflows. (Hugging Face)
- Merge to 16-bit and push a single checkpoint for drop-in Transformers use.
Why both? Adapters enable composition; merged weights simplify downstream deployment.

8) Reproducing the Notebook

Install (Colab/local): run the first cell to set up Unsloth, vLLM, TRL, etc. (GitHub, VLLM Documentation)
Load model: Qwen2.5-3B-Instruct with 4-bit loading + LoRA targets (Q/K/V/O, gate/up/down). (Hugging Face)
Load GSM8K and map prompts to the XML schema with gold answers. (Hugging Face)
Define rewards as in the code.
Configure GRPO (see the GRPOConfig cell) and train. (Hugging Face)
Run sanity inference, save LoRA, re-load LoRA for A/B, then merge and push.

9) How This Echoes DeepSeek’s Training

Same RL family: We use GRPO to improve reasoning—the algorithm introduced by DeepSeek for math reasoning and carried into R1’s reasoning pipeline.
Online, multi-sample training: Generate groups of outputs per prompt and compute relative advantages—a core GRPO idea.
Reasoning-first objective: While our rewards are simpler (structure + correctness), the spirit matches R1’s focus on reinforcement-driven reasoning quality. (arXiv)

Important: We do not claim parity with DeepSeek’s scale, datasets, or exact multi-stage recipes; this is a lightweight, reproducible adaptation of the same RL principle on a 3B model.

10) Evaluation Tips

Exact-match on <answer> for GSM8K; optionally add stop sequences after </answer>.
Track: format adherence, accuracy, and length of <reasoning>.
For deeper rigor, compare with a pure SFT baseline and an offline RFT/DPO variant.

11) Limitations & Ethics

Small scale: 3B with GSM8K is great for demos but not state-of-the-art.
Rewards are brittle: Exact-match favors formatting; consider numeric tolerance and unit handling.
Schema lock-in: Over-constraining output can hurt fluency; keep a balance.
Safety: Do not deploy for high-stakes decisions without thorough red-teaming and guardrails.

12) References & Further Reading

DeepSeek / GRPO / R1
- DeepSeekMath: Introduces GRPO (critic-free PPO variant) and shows reasoning gains.
- DeepSeek-R1: RL-centric reasoning training; R1-Zero trained without SFT, R1 uses multi-stage + RL. (arXiv)
- Plain-language overviews of GRPO (HF course; community explainers). (Hugging Face, Oxen.ai)
GRPO in TRL & vLLM
- TRL GRPOTrainer docs; vLLM for fast online sampling in RL. (Hugging Face, VLLM Documentation)
Models & Libraries
- Qwen2.5-3B-Instruct model card / blog. (Hugging Face, Qwen)
- Unsloth (repo/wiki; 2× inference notes). (GitHub)
- PEFT docs and library. (Hugging Face, GitHub)
- vLLM docs. (VLLM Documentation)
Data / Quantization
- GSM8K paper + dataset. (arXiv, Hugging Face)
- LoRA paper; QLoRA paper/blog. (arXiv, Hugging Face)

Acknowledgements

Thanks to the DeepSeek team for releasing work that popularized GRPO for reasoning.
Thanks to Qwen, Hugging Face (Transformers, TRL, PEFT), Unsloth, vLLM maintainers, and the GSM8K authors for open resources. (Hugging Face, GitHub, VLLM Documentation, arXiv)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Qwen2_5_(3B)_GRPO.html		Qwen2_5_(3B)_GRPO.html
README.md		README.md
qwen2-5-3b-grpo.ipynb		qwen2-5-3b-grpo.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GRPO-CoT: Structure-Aware Reinforcement Learning on Qwen2.5-3B (in the spirit of DeepSeek)

TL;DR

1) Background & Motivation

2) What This Repo/Notebook Does

3) Environment & Dependencies

4) Data & Prompting

5) Rewards (Multi-Objective)

6) Training with GRPO (via TRL + vLLM)

7) Inference, A/B Checks, and Publishing

8) Reproducing the Notebook

9) How This Echoes DeepSeek’s Training

10) Evaluation Tips

11) Limitations & Ethics

12) References & Further Reading

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

SrikarVeluvali/Qwen-2.5-GRPO-RL

Folders and files

Latest commit

History

Repository files navigation

GRPO-CoT: Structure-Aware Reinforcement Learning on Qwen2.5-3B (in the spirit of DeepSeek)

TL;DR

1) Background & Motivation

2) What This Repo/Notebook Does

3) Environment & Dependencies

4) Data & Prompting

5) Rewards (Multi-Objective)

6) Training with GRPO (via TRL + vLLM)

7) Inference, A/B Checks, and Publishing

8) Reproducing the Notebook

9) How This Echoes DeepSeek’s Training

10) Evaluation Tips

11) Limitations & Ethics

12) References & Further Reading

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages