Table of Contents 📖
conda create -n open_reasoner python=3.10
conda activate open_reasoner
pip install -r requirements.txt
pip3 install "fschat[model_worker,webui]"
pip install -U pydantic
cd envs/MATH/latex2sympy
pip install -e .
cd -
Before running the project, please ensure that all required base models are downloaded. The models used in this project include:
Qwen2.5-72B-Instruct-GPTQ-Int4
,Qwen2.5-7B-Instruct-GPTQ-Int4
Meta-Llama-3-70B-Instruct-GPTQ-Int4
,Llama-3-8B-Instruct-GPTQ-4-Bit
peiyi9979/math-shepherd-mistral-7b-prm
openreasoner/Math-psa
To download these models, please refer to the Hugging Face model downloading tutorial for step-by-step guidance on downloading models from the Hugging Face Hub.
Please make sure that all models are saved in their directories according to the project setup before proceeding.
Before running inference, please modify the following variables in the scripts under reason/llm_service/
to set the appropriate base models for your usage:
$MODEL_BASE
: Set this to the directory where your models are stored.$POLICY_MODEL_NAME
: Set this to the name of the large model you wish to use.$VALUE_MODEL_NAME
: Set this to the name of the value model you wish to use.APPROXIMATION_MODEL_NAME
: Set this to the name of the approximation model you wish to use.$NUM_LM_WORKER
: Set this to the number of language model (LM) workers to start.$NUM_RM_WORKER
: Set this to the number of reward model (RM) workers to start.
Then it prepares and runs inference using different techniques.
For example, to start the LM, sLM and RM services for the Math Shepherd model, run the following command:
sh reason/llm_service/create_service_math_shepherd.sh
To kill the server processes, recommend using the following command:
tmux kill-session -t {Your Session Name} # default is `FastChat`
The main files are organized as follows:
./reason/evaluation/evaluate.py
: Main code for overall inference pipeline../reason/evaluation/methods.py
: Implementation of different search methods../envs/base_env.py
: Implementation details of SpecSearch.
All hyperparameters used for inference are set in the bash file ./scripts/eval/xxx.sh
, e.g.,
- c : Initial threshold.
- baseline_mode : 1-5 for baselines, 6 for initial method using a fixed threshold, 7 for our SpecSearch, 8-10 for ablation study.
- model_mode : 1 for Qwen models, 2 for Llama models.
- N : Number of thoughts per step.
- M : Beam size.
- is_tensorboard : 1 for tensorboard, 0 for no tensorboard.
- full_reward : 1 for full reward, 0 for partial reward.
- x : Value of theta.
--LM
, --sLM
, --RM
) in the script aligns with the variables ($POLICY_MODEL_NAME
, $APPROXIMATION_MODEL_NAME
, $VALUE_MODEL_NAME
) in the pending worker!
export PYTHONPATH=$(pwd)
sh scripts/eval/beam_search_qwen.sh
sh scripts/eval/beam_search_llama.sh
sh scripts/eval/vanila_mcts_qwen.sh
sh scripts/eval/vanila_mcts_llama.sh
This project builds upon the contributions of several open-source efforts:
- OpenR: for providing core functionalities for evaluation and baseline methods.
- LLaMA and Qwen: for the pretrained language models used in the inference process.
We thank the developers and maintainers of these projects for their excellent work.
If you find this project helpful for your research, please consider citing our work:
@misc{wang2025acceleratinglargelanguagemodel,
title={Accelerating Large Language Model Reasoning via Speculative Search},
author={Zhihai Wang and Jie Wang and Jilai Pan and Xilin Xia and Huiling Zhen and Mingxuan Yuan and Jianye Hao and Feng Wu},
year={2025},
eprint={2505.02865},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.02865},
}