GitHub - Turbo-AGI/MetaStone-S1: The open-source code of MetaStone-S1.

Introduction

We release our first reflective generative model: MetaStone-S1. With only 32B parameters, MetaStone-S1 performs comparably to the OpenAI-o3 series on mathematics, coding, and Chinese reasoning tasks.

MetaStone‑S1 is trained based on our proposed reflective generative form, which combines “Long-CoT Reinforcement Learning” and “Process Reward Learning” into a unified training form. This form enables a single model to simultaneously achieve deep reasoning and high-quality reasoning trajectory selection. By sharing the backbone network between the PRMs and policy models, MetaStone‑S1 significantly reduces the inference cost of PRMs by 99%, resulting in faster and higher-quality responses.

This repo contains the training and evaluation code of MetaStone-S1. For full details please refer to our paper and our official website.

Installation

conda create -n metastone-s1 python==3.10 
conda activate metastone-s1
pip install -e verl
pip install -r requirements.txt
pip install flash_attn==2.7.3

Model Zoo

Model	Transformers(HF)	ModelScope
MetaStone-S1-1.5B	MetaStone-S1-1.5B	MetaStone-S1-1.5B
MetaStone-S1-7B	MetaStone-S1-7B	MetaStone-S1-7B
MetaStone-S1-32B	MetaStone-S1-32B	MetaStone-S1-32B

Performance

Performance on small-size Models

Model	AIME24	AIME25	LiveCodeBench	C-EVAL
DeepScaleR-1.5B-Preview	43.1	30.0	-	-
R1-Distill-Qwen-1.5B	28.9	22.8	16.9	27.1
R1-Distill-Qwen-7B	55.5	-	37.6	-
R1-Distill-Llama-8B	50.4	-	39.6	-
MetaStone-S1-1.5B-low	44.0	32.6	24.2	43.6
MetaStone-S1-1.5B-medium	53.1	35.7	26.6	43.9
MetaStone-S1-1.5B-high	57.9	40.4	28.1	44.1
MetaStone-S1-7B-low	60.7	45.4	41.7	55.1
MetaStone-S1-7B-medium	66.3	48.3	44.1	57.5
MetaStone-S1-7B-high	70.2	48.6	44.4	57.8

Performance on large-size Models. Since the base model used for this repo is QwQ-32B, we chose the contemporary DeepSeek R1-671B-0120 to ensure a fair comparison.

Model	AIME24	AIME25	LiveCodeBench	C-EVAL
s1-32B	56.7	50.0	-	-
QwQ-32B	79.5	69.5	63.4	88.4
R1-Distill-Qwen-32B	72.6	49.6	57.2	82.2
GLM-Z1-32B-0414	80.8	63.6	59.1	-
DeepSeek-R1-671B	79.8	70.0	65.9	91.8
Claude-3.5-Sonnet1022	16.0	7.4	37.2	76.7
GPT-4o-0513	9.3	11.6	32.9	-
OpenAI-o1-mini	63.6	50.7	53.8	68.9
OpenAI-o1-1217	79.2	-	63.4	-
OpenAI-o3-mini-medium	79.6	74.8	67.4	75.9
MetaStone-S1-32B-low	82.0	72.0	63.8	89.5
MetaStone-S1-32B-medium	84.2	73.4	64.0	89.6
MetaStone-S1-32B-high	85.2	73.6	64.2	89.7

Train

Single-Node Training

export WANDB_API_KEY=YOUR_WANDB_API_KEY
bash ./scripts/run_single_node.sh

Multi-Node Training

# start ray
bash ./verl/examples/ray/run_worker_n.sh
# start training
bash ./scripts/run_multi_node.sh

Convert the checkpoint to Huggingface format

python convert_ckpt.py --root path/to/model --step n --world_size 8

Evaluation

We now release the naive test pipeline for mathematical benchmarks.

Step 1: Run the API of the reward model

CUDA_VISIBLE_DEVICES=0 python test/score_model_queue.py --model_path path/to/huggingface/model --score_model_dim 1536 --lang 'en' --ip '0.0.0.0' --port '8001'

Step 2: Run the API of the policy model

export VLLM_ATTENTION_BACKEND=XFORMERS
CUDA_VISIBLE_DEVICES=0 python test/policy_model_queue.py --model_path path/to/huggingface/model --ip '0.0.0.0' --port '8000'

We recommend starting multiple policy model APIs for fast evaluation.

Step 3: Inference on the target benchmark

python test/inference.py --task 'aime24' --input_file data/aime24.jsonl --output_file path/to/result --n_samples 16 --model_dir path/to/huggingface/model --score_api_url "http://ip:port/score" --response_api_url "http://ip1:port1/score,http://ip2:port2/score" --branch 2

We recommend setting the branch to 2× the number of policy model APIs

Step4: Compute the pass@1 metric

python test/compute_metric.py --task 'aime24' --result_paths path/to/result --N 2

Set N to 2/8/32 for low/medium/high mode

Citation

If you find our work helpful, feel free to give us a cite.

@misc{wang2025testtimescalingreflectivegenerative,
 title={Test-Time Scaling with Reflective Generative Model}, 
 author={Zixiao Wang and Yuxin Wang and Xiaorui Wang and Mengting Xing and Jie Gao and Jianjun Xu and Guangcan Liu and Chenhui Jin and Zhuo Wang and Shengzhuo Zhang and Hongtao Xie},
 year={2025},
 eprint={2507.01951},
 archivePrefix={arXiv},
 primaryClass={cs.LG},
 url={https://arxiv.org/abs/2507.01951}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Installation

Model Zoo

Performance

Train

Single-Node Training

Multi-Node Training

Convert the checkpoint to Huggingface format

Evaluation

Step 1: Run the API of the reward model

Step 2: Run the API of the policy model

Step 3: Inference on the target benchmark

Step4: Compute the pass@1 metric

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
figures		figures
scripts		scripts
test		test
verl		verl
README.md		README.md
README_zh.md		README_zh.md
convert_ckpt.py		convert_ckpt.py
requirements.txt		requirements.txt

Turbo-AGI/MetaStone-S1

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Model Zoo

Performance

Train

Single-Node Training

Multi-Node Training

Convert the checkpoint to Huggingface format

Evaluation

Step 1: Run the API of the reward model

Step 2: Run the API of the policy model

Step 3: Inference on the target benchmark

Step4: Compute the pass@1 metric

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages