Skip to content

Turbo-AGI/MetaStone-S1

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

We release our first reflective generative model: MetaStone-S1. With only 32B parameters, MetaStone-S1 performs comparably to the OpenAI-o3 series on mathematics, coding, and Chinese reasoning tasks. Performance compared with OpenAI-o3-mini

MetaStone‑S1 is trained based on our proposed reflective generative form, which combines “Long-CoT Reinforcement Learning” and “Process Reward Learning” into a unified training form. This form enables a single model to simultaneously achieve deep reasoning and high-quality reasoning trajectory selection. By sharing the backbone network between the PRMs and policy models, MetaStone‑S1 significantly reduces the inference cost of PRMs by 99%, resulting in faster and higher-quality responses.

Introduction

This repo contains the training and evaluation code of MetaStone-S1. For full details please refer to our paper and our official website.

Installation

conda create -n metastone-s1 python==3.10 
conda activate metastone-s1
pip install -e verl
pip install -r requirements.txt
pip install flash_attn==2.7.3

Model Zoo

Model Transformers(HF) ModelScope
MetaStone-S1-1.5B MetaStone-S1-1.5B MetaStone-S1-1.5B
MetaStone-S1-7B MetaStone-S1-7B MetaStone-S1-7B
MetaStone-S1-32B MetaStone-S1-32B MetaStone-S1-32B

Performance

Performance on small-size Models

Model AIME24 AIME25 LiveCodeBench C-EVAL
DeepScaleR-1.5B-Preview 43.1 30.0 - -
R1-Distill-Qwen-1.5B 28.9 22.8 16.9 27.1
R1-Distill-Qwen-7B 55.5 - 37.6 -
R1-Distill-Llama-8B 50.4 - 39.6 -
MetaStone-S1-1.5B-low 44.0 32.6 24.2 43.6
MetaStone-S1-1.5B-medium 53.1 35.7 26.6 43.9
MetaStone-S1-1.5B-high 57.9 40.4 28.1 44.1
MetaStone-S1-7B-low 60.7 45.4 41.7 55.1
MetaStone-S1-7B-medium 66.3 48.3 44.1 57.5
MetaStone-S1-7B-high 70.2 48.6 44.4 57.8

Performance on large-size Models. Since the base model used for this repo is QwQ-32B, we chose the contemporary DeepSeek R1-671B-0120 to ensure a fair comparison.

Model AIME24 AIME25 LiveCodeBench C-EVAL
s1-32B 56.7 50.0 - -
QwQ-32B 79.5 69.5 63.4 88.4
R1-Distill-Qwen-32B 72.6 49.6 57.2 82.2
GLM-Z1-32B-0414 80.8 63.6 59.1 -
DeepSeek-R1-671B 79.8 70.0 65.9 91.8
Claude-3.5-Sonnet1022 16.0 7.4 37.2 76.7
GPT-4o-0513 9.3 11.6 32.9 -
OpenAI-o1-mini 63.6 50.7 53.8 68.9
OpenAI-o1-1217 79.2 - 63.4 -
OpenAI-o3-mini-medium 79.6 74.8 67.4 75.9
MetaStone-S1-32B-low 82.0 72.0 63.8 89.5
MetaStone-S1-32B-medium 84.2 73.4 64.0 89.6
MetaStone-S1-32B-high 85.2 73.6 64.2 89.7

Train

Single-Node Training

export WANDB_API_KEY=YOUR_WANDB_API_KEY
bash ./scripts/run_single_node.sh

Multi-Node Training

# start ray
bash ./verl/examples/ray/run_worker_n.sh
# start training
bash ./scripts/run_multi_node.sh

Convert the checkpoint to Huggingface format

python convert_ckpt.py --root path/to/model --step n --world_size 8

Evaluation

We now release the naive test pipeline for mathematical benchmarks.

Step 1: Run the API of the reward model

CUDA_VISIBLE_DEVICES=0 python test/score_model_queue.py --model_path path/to/huggingface/model --score_model_dim 1536 --lang 'en' --ip '0.0.0.0' --port '8001'

Step 2: Run the API of the policy model

export VLLM_ATTENTION_BACKEND=XFORMERS
CUDA_VISIBLE_DEVICES=0 python test/policy_model_queue.py --model_path path/to/huggingface/model --ip '0.0.0.0' --port '8000'

We recommend starting multiple policy model APIs for fast evaluation.

Step 3: Inference on the target benchmark

python test/inference.py --task 'aime24' --input_file data/aime24.jsonl --output_file path/to/result --n_samples 16 --model_dir path/to/huggingface/model --score_api_url "http://ip:port/score" --response_api_url "http://ip1:port1/score,http://ip2:port2/score" --branch 2

We recommend setting the branch to 2× the number of policy model APIs

Step4: Compute the pass@1 metric

python test/compute_metric.py --task 'aime24' --result_paths path/to/result --N 2

Set N to 2/8/32 for low/medium/high mode

Citation

If you find our work helpful, feel free to give us a cite.

@misc{wang2025testtimescalingreflectivegenerative,
 title={Test-Time Scaling with Reflective Generative Model}, 
 author={Zixiao Wang and Yuxin Wang and Xiaorui Wang and Mengting Xing and Jie Gao and Jianjun Xu and Guangcan Liu and Chenhui Jin and Zhuo Wang and Shengzhuo Zhang and Hongtao Xie},
 year={2025},
 eprint={2507.01951},
 archivePrefix={arXiv},
 primaryClass={cs.LG},
 url={https://arxiv.org/abs/2507.01951}, 
}

About

The open-source code of MetaStone-S1.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 77.2%
  • Jupyter Notebook 17.7%
  • Shell 5.1%