πͺ‘ We introduce a novel framework within GRPO to utilize model-based verifier to support wider range of answer types. We train on WebInstruct-verfied, a dataset covering wide range of reasoning topics beyond math. We demonstrate substantial improvements over traditional binary rule-based rewards across diverse domains beyond math.
β
Model-based verifier to support verification of diverse answer types like math expression, string, list, fraction, matrix, etc;
β
Small 7B/14B models achieve robust cross-domain rewards; It boosts MMLU-Pro performance by 13%.
β
Our method does not require any additional SFT.
Check out our Arxiv Paper for the details!
Our experimental results are as follows:
Our discpline distribution is described as follows:
Our model-based verifier helps scaling the verifiable reasoning questions:
Data | Size | Link |
---|---|---|
WebInstruct-verified | 230k | π€ |
Model | Backbone | Link |
---|---|---|
General-Verifier | Qwen/Qwen2.5-Math-1.5B | π€ |
Check out HF page to learn how to use it. Feel free to plug this into your current RL-training framework.
Model | Backbone | Link |
---|---|---|
General-Reasoner-Qwen2.5-7B | Qwen2.5-7B-Base | π€ |
General-Reasoner-Qwen2.5-14B | Qwen2.5-14B-Base | π€ |
General-Reasoner-Qwen3-4B | Qwen3-4B-Base | π€ |
General-Reasoner-Qwen3-14B | Qwen3-14B-Base | π€ |
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install flash-attn --no-build-isolation
pip install -e ./verl
pip install vllm==0.8.3
pip install flashinfer-python
pip install math-verify
python data_preprocess.py --local-dir <data_dir>/webinstruct-verified
huggingface-cli download TIGER-Lab/general-verifier --local-dir <data_dir>/general-reasoner-verifier
huggingface-cli download Qwen/Qwen2.5-7B --local-dir <data_dir>/Qwen2.5-7B
Edit the environment variables in train_general_reasoner.sh
to fit your system setup.
ray start --address <MASTER-NODE-IP>:6379
bash train_general_reasoner.sh
python -m evaluation.eval_mmlupro \
--model_path TIGER-Lab/General-Reasoner-14B \
--output_file output-mmlupro-General-Reasoner-14B.json
python -m evaluation.eval_supergpqa \
--model_path TIGER-Lab/General-Reasoner-14B \
--output_file output-supergpqa-General-Reasoner-14B.json
We evaluate math and GPQA tasks using the simple-eval
framework.
For non-multiple choice questions, answer equivalence is verified using GPT-4o
.
export OPENAI_API_KEY=<replace by OPENAI API KEY>
vllm serve TIGER-Lab/General-Reasoner-14B --tensor-parallel-size 4
python -m evaluation.simple-evals.run_simple_evals_qwen \
--model General-Reasoner-14B
By default, the model uses greedy decoding. For AIME24 and AIME25, scores are averaged over 32 runs with temperature 1. For more configuration details, refer to
evaluation/simple-evals/run_simple_evals_qwen.py
.
Our 7B and 14B models are trained from the corresponding base qwen models.
Model Name | MMLU-Pro | GPQA | SuperGPQA | TheoremQA | BBEH |
---|---|---|---|---|---|
Qwen2.5-7B-Base | 47.7 | 25.8 | 26.7 | 29.1 | 8.0 |
Qwen2.5-7B-Instruct | 57.0 | 33.8 | 30.7 | 36.6 | 12.2 |
SimpleRL-Qwen2.5-7B-Zoo | 51.5 | 24.2 | 29.9 | 38.0 | 11.9 |
General-Reasoner-7B | 58.9 | 34.3 | 34.2 | 45.3 | 12.5 |
Qwen2.5-14B-Base | 53.3 | 32.8 | 30.7 | 33.0 | 10.8 |
Qwen2.5-14B-Instruct | 62.7 | 41.4 | 35.8 | 41.9 | 15.2 |
SimpleRL-Qwen2.5-14B-Zoo | 64.0 | 39.4 | 35.7 | 40.8 | 13.6 |
General-Reasoner-14B | 66.6 | 43.4 | 39.5 | 44.3 | 15.2 |
Qwen3-4B-Base | 51.6 | 26.3 | 25.4 | 34.8 | 8.1 |
Qwen3-4B-Instruct | 61.8 | 41.7 | 32.1 | 42.0 | 14.9 |
General-Reasoner-Qw3-4B | 62.8 | 42.9 | 32.5 | 48.3 | 12.2 |
Qwen3-14B-Base | 64.2 | 45.9 | 36.5 | 44.0 | 13.0 |
Qwen3-14B-Instruct | 70.9 | 54.8 | 39.8 | 42.4 | 19.2 |
General-Reasoner-Qw3-14B | 70.3 | 56.1 | 39.9 | 54.4 | 17.3 |
Model Name | MATH-500 | Olympiad | Minerva | GSM8K | AMC | AIME24x32 | AIME25x32 |
---|---|---|---|---|---|---|---|
Qwen2.5-7B-Base | 60.2 | 28.6 | 36.0 | 83.1 | 30.0 | 3.8 | 1.4 |
Qwen2.5-7B-Instruct | 75.0 | 39.4 | 45.2 | 90.9 | 52.5 | 12.5 | 8.5 |
SimpleRL-Qwen2.5-7B-Zoo | 74.0 | 41.9 | 49.6 | 90.7 | 60.0 | 15.2 | 7.5 |
General-Reasoner-7B | 76.0 | 37.9 | 54.0 | 92.7 | 55.0 | 13.8 | 10.4 |
Qwen2.5-14B-Base | 65.4 | 33.5 | 24.3 | 91.6 | 37.5 | 3.6 | 2.9 |
Qwen2.5-14B-Instruct | 77.4 | 44.7 | 52.2 | 94.5 | 57.5 | 12.2 | 11.0 |
SimpleRL-Qwen2.5-14B-Zoo | 77.2 | 44.6 | 54.0 | 94.2 | 60.0 | 12.9 | 11.8 |
General-Reasoner-14B | 78.6 | 42.1 | 58.1 | 94.2 | 70.0 | 17.5 | 16.9 |
Qwen3-4B-Base | 68.2 | 34.8 | 42.3 | 72.6 | 47.5 | 10.3 | 6.7 |
Qwen3-4B-Instruct(non-think) | 80.4 | 49.0 | 57.0 | 92.0 | 62.5 | 22.5 | 16.1 |
General-Reasoner-Qw3-4B | 80.6 | 47.7 | 57.7 | 92.2 | 60.0 | 20.0 | 15.4 |
Qwen3-14B-Base | 74.6 | 44.3 | 55.9 | 93.2 | 55.0 | 14.7 | 11.4 |
Qwen3-14B-Instruct(non-think) | 82.0 | 52.4 | 59.9 | 93.9 | 57.5 | 28.5 | 25.1 |
General-Reasoner-Qw3-14B | 83.8 | 51.9 | 68.0 | 94.4 | 70.0 | 24.4 | 19.2 |
This project is built upon the following open-source projects:
@article{general-reasoner,
title={General-Reasoner: Advancing LLM Reasoning Across All Domains},
author={Xueguang Ma and Qian Liu and Dongfu Jiang and Ge Zhang and Zejun Ma and Wenhu Chen},
year={2025},
journal={arXiv:2505.14652},
url={https://arxiv.org/abs/2505.14652},
}