We release our first reflective generative model: MetaStone-S1.
With only 32B parameters, MetaStone-S1 performs comparably to the OpenAI-o3 series on mathematics, coding, and Chinese reasoning tasks.
MetaStone‑S1 is trained based on our proposed reflective generative form, which combines “Long-CoT Reinforcement Learning” and “Process Reward Learning” into a unified training form. This form enables a single model to simultaneously achieve deep reasoning and high-quality reasoning trajectory selection. By sharing the backbone network between the PRMs and policy models, MetaStone‑S1 significantly reduces the inference cost of PRMs by 99%, resulting in faster and higher-quality responses.
This repo contains the training and evaluation code of MetaStone-S1. For full details please refer to our paper and our official website.
conda create -n metastone-s1 python==3.10
conda activate metastone-s1
pip install -e verl
pip install -r requirements.txt
pip install flash_attn==2.7.3
Model | Transformers(HF) | ModelScope |
---|---|---|
MetaStone-S1-1.5B | MetaStone-S1-1.5B | MetaStone-S1-1.5B |
MetaStone-S1-7B | MetaStone-S1-7B | MetaStone-S1-7B |
MetaStone-S1-32B | MetaStone-S1-32B | MetaStone-S1-32B |
Performance on small-size Models
Model | AIME24 | AIME25 | LiveCodeBench | C-EVAL |
---|---|---|---|---|
DeepScaleR-1.5B-Preview | 43.1 | 30.0 | - | - |
R1-Distill-Qwen-1.5B | 28.9 | 22.8 | 16.9 | 27.1 |
R1-Distill-Qwen-7B | 55.5 | - | 37.6 | - |
R1-Distill-Llama-8B | 50.4 | - | 39.6 | - |
MetaStone-S1-1.5B-low | 44.0 | 32.6 | 24.2 | 43.6 |
MetaStone-S1-1.5B-medium | 53.1 | 35.7 | 26.6 | 43.9 |
MetaStone-S1-1.5B-high | 57.9 | 40.4 | 28.1 | 44.1 |
MetaStone-S1-7B-low | 60.7 | 45.4 | 41.7 | 55.1 |
MetaStone-S1-7B-medium | 66.3 | 48.3 | 44.1 | 57.5 |
MetaStone-S1-7B-high | 70.2 | 48.6 | 44.4 | 57.8 |
Performance on large-size Models. Since the base model used for this repo is QwQ-32B, we chose the contemporary DeepSeek R1-671B-0120 to ensure a fair comparison.
Model | AIME24 | AIME25 | LiveCodeBench | C-EVAL |
---|---|---|---|---|
s1-32B | 56.7 | 50.0 | - | - |
QwQ-32B | 79.5 | 69.5 | 63.4 | 88.4 |
R1-Distill-Qwen-32B | 72.6 | 49.6 | 57.2 | 82.2 |
GLM-Z1-32B-0414 | 80.8 | 63.6 | 59.1 | - |
DeepSeek-R1-671B | 79.8 | 70.0 | 65.9 | 91.8 |
Claude-3.5-Sonnet1022 | 16.0 | 7.4 | 37.2 | 76.7 |
GPT-4o-0513 | 9.3 | 11.6 | 32.9 | - |
OpenAI-o1-mini | 63.6 | 50.7 | 53.8 | 68.9 |
OpenAI-o1-1217 | 79.2 | - | 63.4 | - |
OpenAI-o3-mini-medium | 79.6 | 74.8 | 67.4 | 75.9 |
MetaStone-S1-32B-low | 82.0 | 72.0 | 63.8 | 89.5 |
MetaStone-S1-32B-medium | 84.2 | 73.4 | 64.0 | 89.6 |
MetaStone-S1-32B-high | 85.2 | 73.6 | 64.2 | 89.7 |
export WANDB_API_KEY=YOUR_WANDB_API_KEY
bash ./scripts/run_single_node.sh
# start ray
bash ./verl/examples/ray/run_worker_n.sh
# start training
bash ./scripts/run_multi_node.sh
python convert_ckpt.py --root path/to/model --step n --world_size 8
We now release the naive test pipeline for mathematical benchmarks.
CUDA_VISIBLE_DEVICES=0 python test/score_model_queue.py --model_path path/to/huggingface/model --score_model_dim 1536 --lang 'en' --ip '0.0.0.0' --port '8001'
export VLLM_ATTENTION_BACKEND=XFORMERS
CUDA_VISIBLE_DEVICES=0 python test/policy_model_queue.py --model_path path/to/huggingface/model --ip '0.0.0.0' --port '8000'
We recommend starting multiple policy model APIs for fast evaluation.
python test/inference.py --task 'aime24' --input_file data/aime24.jsonl --output_file path/to/result --n_samples 16 --model_dir path/to/huggingface/model --score_api_url "http://ip:port/score" --response_api_url "http://ip1:port1/score,http://ip2:port2/score" --branch 2
We recommend setting the branch to 2× the number of policy model APIs
python test/compute_metric.py --task 'aime24' --result_paths path/to/result --N 2
Set N to 2/8/32 for low/medium/high mode
If you find our work helpful, feel free to give us a cite.
@misc{wang2025testtimescalingreflectivegenerative,
title={Test-Time Scaling with Reflective Generative Model},
author={Zixiao Wang and Yuxin Wang and Xiaorui Wang and Mengting Xing and Jie Gao and Jianjun Xu and Guangcan Liu and Chenhui Jin and Zhuo Wang and Shengzhuo Zhang and Hongtao Xie},
year={2025},
eprint={2507.01951},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.01951},
}