Pretrained checkpoints are available on Hugging Face:
- Merged 16-bit (drop-in Transformers): srikar-v05/Qwen2.5-3B-GRPO-16bit
- LoRA adapters (PEFT/composable): srikar-v05/Qwen2.5-3B-GRPO-LoRA
Both variants were trained with GRPO on GSM8K using an XML reasoning schema, in the spirit of DeepSeek’s GRPO-based reasoning training.
A lightweight, reproducible pipeline that fine-tunes Qwen2.5-3B-Instruct with LoRA and trains it with GRPO (Group Relative Policy Optimization) on GSM8K, enforcing an XML reasoning schema for easy evaluation—conceptually similar to DeepSeek’s GRPO-based reasoning training.
- Base model: Qwen/Qwen2.5-3B-Instruct
- Method: 4-bit QLoRA + GRPO (online RL with multi-objective rewards)
- Runtime: Unsloth for efficient LoRA training; vLLM for fast candidate generation during RL
- Task/data: GSM8K (grade-school math)
- Output schema: XML tags
<reasoning>
and<answer>
to make grading trivial - Artifacts: Save LoRA, merge to 16-bit, and push to Hub
Why “similar to DeepSeek”? We adopt GRPO (critic-free, group-baseline PPO variant) to improve reasoning, the same RL algorithm family DeepSeek introduced in DeepSeekMath and later used in DeepSeek-R1’s reasoning-focused training.
- GRPO for reasoning. DeepSeek introduced GRPO, which replaces the value function with a group baseline over multiple samples from the same prompt—cutting memory and complexity while keeping PPO-style stability. This was shown to boost math reasoning and later underpinned R1’s reasoning training.
- DeepSeek-R1 pipeline. R1-Zero was trained purely with RL (no SFT); R1 added multi-stage training and RL to improve readability and performance—establishing a strong precedent for RL-driven reasoning. Our project is inspired by this approach (not an exact replica). (arXiv)
- Why LoRA / QLoRA. LoRA adapts only small rank-decomposed matrices—massively reducing trainable params. QLoRA enables 4-bit fine-tuning while preserving 16-bit quality, making the project feasible on modest GPUs. (arXiv)
- Why Unsloth & vLLM. Unsloth speeds up LoRA/QLoRA fine-tuning; vLLM provides high-throughput generation that plugs into TRL’s online RL (GRPO) loop. (GitHub, VLLM Documentation)
-
Loads Qwen2.5-3B-Instruct (3B) and wraps it with LoRA adapters for attention & MLP projections. (Hugging Face)
-
Prepares GSM8K with a system prompt that enforces an XML schema. (arXiv)
-
Defines five reward functions:
- XML structure (strict/soft), XML token-shape (count), integer-answer heuristic, and exact-match correctness.
-
Runs TRL’s GRPOTrainer with vLLM to sample multiple generations per prompt (group) and train the policy using group-relative advantages + KL to a reference policy—a GRPO hallmark. (Hugging Face)
-
Saves the LoRA, evaluates before/after, merges to 16-bit, and optionally pushes to the Hub for easy reuse.
The overall recipe mirrors the GRPO-driven reasoning emphasis in DeepSeekMath and R1: online RL, multi-candidate sampling, group-relative baselining, and KL regularization to a reference policy.
- Unsloth (fast LoRA/QLoRA fine-tuning), vLLM (fast generation), TRL (GRPO trainer), Transformers, PEFT, bitsandbytes. (GitHub, VLLM Documentation, Hugging Face)
- Quantized LoRA (QLoRA) relies on 4-bit NF4 / double-quantization with bitsandbytes. (arXiv, Hugging Face)
The notebook includes a Colab-aware installer and GPU checks.
-
GSM8K: 8.5k grade-school math word problems; we parse the gold label after
####
and wrap prompts in a system schema enforcing<reasoning>
/<answer>
. (arXiv, Hugging Face) -
Why XML schema?
- Easy to parse & grade.
- Lets us separate CoT (reasoning) from the final answer for targeted rewards.
- XML structure (strict / soft) — pushes completions to respect the schema.
- XML token-shape counter — stabilizes early training by rewarding partial structural compliance.
- Integer-answer check — cheap sanity for numeric tasks.
- Exact-match correctness — aligns the
<answer>
text to the gold label.
Combining structure + outcome signals is a lightweight proxy for outcome/process rewards that DeepSeek explored at scale; we keep it simple but schema-aligned.
-
Online RL: generate multiple candidates per prompt with vLLM; compute group-relative advantages; update the policy while regularizing to a reference policy (KL). (Hugging Face)
-
No critic/value model: GRPO uses the group mean as a baseline, reducing memory vs PPO—exactly the DeepSeek idea.
-
Config highlights:
num_generations
(group size),max_completion_length
, conservative LR, cosine schedule, gradient clipping, checkpointing.
For context on GRPO’s design and usage within TRL and vLLM, see the TRL GRPO docs and vLLM-TRL integration notes. (Hugging Face, VLLM Documentation)
-
Before/after: Query the model (“How many r’s in strawberry?”) pre-adapter vs. with the saved LoRA to verify the adapter actually moves behavior.
-
Save & share:
- Save LoRA (adapter-only) for PEFT workflows. (Hugging Face)
- Merge to 16-bit and push a single checkpoint for drop-in Transformers use.
-
Why both? Adapters enable composition; merged weights simplify downstream deployment.
- Install (Colab/local): run the first cell to set up Unsloth, vLLM, TRL, etc. (GitHub, VLLM Documentation)
- Load model: Qwen2.5-3B-Instruct with 4-bit loading + LoRA targets (Q/K/V/O, gate/up/down). (Hugging Face)
- Load GSM8K and map prompts to the XML schema with gold answers. (Hugging Face)
- Define rewards as in the code.
- Configure GRPO (see the
GRPOConfig
cell) and train. (Hugging Face) - Run sanity inference, save LoRA, re-load LoRA for A/B, then merge and push.
- Same RL family: We use GRPO to improve reasoning—the algorithm introduced by DeepSeek for math reasoning and carried into R1’s reasoning pipeline.
- Online, multi-sample training: Generate groups of outputs per prompt and compute relative advantages—a core GRPO idea.
- Reasoning-first objective: While our rewards are simpler (structure + correctness), the spirit matches R1’s focus on reinforcement-driven reasoning quality. (arXiv)
Important: We do not claim parity with DeepSeek’s scale, datasets, or exact multi-stage recipes; this is a lightweight, reproducible adaptation of the same RL principle on a 3B model.
- Exact-match on
<answer>
for GSM8K; optionally add stop sequences after</answer>
. - Track: format adherence, accuracy, and length of
<reasoning>
. - For deeper rigor, compare with a pure SFT baseline and an offline RFT/DPO variant.
- Small scale: 3B with GSM8K is great for demos but not state-of-the-art.
- Rewards are brittle: Exact-match favors formatting; consider numeric tolerance and unit handling.
- Schema lock-in: Over-constraining output can hurt fluency; keep a balance.
- Safety: Do not deploy for high-stakes decisions without thorough red-teaming and guardrails.
-
DeepSeek / GRPO / R1
- DeepSeekMath: Introduces GRPO (critic-free PPO variant) and shows reasoning gains.
- DeepSeek-R1: RL-centric reasoning training; R1-Zero trained without SFT, R1 uses multi-stage + RL. (arXiv)
- Plain-language overviews of GRPO (HF course; community explainers). (Hugging Face, Oxen.ai)
-
GRPO in TRL & vLLM
- TRL GRPOTrainer docs; vLLM for fast online sampling in RL. (Hugging Face, VLLM Documentation)
-
Models & Libraries
- Qwen2.5-3B-Instruct model card / blog. (Hugging Face, Qwen)
- Unsloth (repo/wiki; 2× inference notes). (GitHub)
- PEFT docs and library. (Hugging Face, GitHub)
- vLLM docs. (VLLM Documentation)
-
Data / Quantization
- GSM8K paper + dataset. (arXiv, Hugging Face)
- LoRA paper; QLoRA paper/blog. (arXiv, Hugging Face)
- Thanks to the DeepSeek team for releasing work that popularized GRPO for reasoning.
- Thanks to Qwen, Hugging Face (Transformers, TRL, PEFT), Unsloth, vLLM maintainers, and the GSM8K authors for open resources. (Hugging Face, GitHub, VLLM Documentation, arXiv)