Skip to content

This is the source code of our ICML25 paper, titled "Accelerating Large Language Model Reasoning via Speculative Search".

Notifications You must be signed in to change notification settings

MIRALab-USTC/LLMReasoning-SpecSearch

Repository files navigation

Accelerating Large Language Model Reasoning via Speculative Search

Paper · Code

[ English ][ 中文 ]

Table of Contents 📖
  1. Getting Started
  2. Usage
  3. Acknowledgements
  4. Citation

Getting Started

Installation

conda create -n open_reasoner python=3.10
conda activate open_reasoner
pip install -r requirements.txt
pip3 install  "fschat[model_worker,webui]"
pip install -U pydantic
cd envs/MATH/latex2sympy
pip install -e .
cd -

Download Base Models

Before running the project, please ensure that all required base models are downloaded. The models used in this project include:

  • Qwen2.5-72B-Instruct-GPTQ-Int4, Qwen2.5-7B-Instruct-GPTQ-Int4
  • Meta-Llama-3-70B-Instruct-GPTQ-Int4, Llama-3-8B-Instruct-GPTQ-4-Bit
  • peiyi9979/math-shepherd-mistral-7b-prm
  • openreasoner/Math-psa

To download these models, please refer to the Hugging Face model downloading tutorial for step-by-step guidance on downloading models from the Hugging Face Hub.

Please make sure that all models are saved in their directories according to the project setup before proceeding.

Quickstart

Before running inference, please modify the following variables in the scripts under reason/llm_service/ to set the appropriate base models for your usage:

  • $MODEL_BASE: Set this to the directory where your models are stored.
  • $POLICY_MODEL_NAME: Set this to the name of the large model you wish to use.
  • $VALUE_MODEL_NAME: Set this to the name of the value model you wish to use.
  • APPROXIMATION_MODEL_NAME: Set this to the name of the approximation model you wish to use.
  • $NUM_LM_WORKER: Set this to the number of language model (LM) workers to start.
  • $NUM_RM_WORKER: Set this to the number of reward model (RM) workers to start.

Then it prepares and runs inference using different techniques.

Start LM & RM Services

For example, to start the LM, sLM and RM services for the Math Shepherd model, run the following command:

sh reason/llm_service/create_service_math_shepherd.sh

To kill the server processes, recommend using the following command:

tmux kill-session -t {Your Session Name} # default is `FastChat`

Usage

The main files are organized as follows:

  • ./reason/evaluation/evaluate.py: Main code for overall inference pipeline.
  • ./reason/evaluation/methods.py: Implementation of different search methods.
  • ./envs/base_env.py: Implementation details of SpecSearch.

All hyperparameters used for inference are set in the bash file ./scripts/eval/xxx.sh, e.g.,

  • c : Initial threshold.
  • baseline_mode : 1-5 for baselines, 6 for initial method using a fixed threshold, 7 for our SpecSearch, 8-10 for ablation study.
  • model_mode : 1 for Qwen models, 2 for Llama models.
  • N : Number of thoughts per step.
  • M : Beam size.
  • is_tensorboard : 1 for tensorboard, 0 for no tensorboard.
  • full_reward : 1 for full reward, 0 for partial reward.
  • x : Value of theta.

Run Inference

⚠️ Make sure the input (--LM, --sLM, --RM) in the script aligns with the variables ($POLICY_MODEL_NAME, $APPROXIMATION_MODEL_NAME, $VALUE_MODEL_NAME) in the pending worker!

export PYTHONPATH=$(pwd)

sh scripts/eval/beam_search_qwen.sh

sh scripts/eval/beam_search_llama.sh

sh scripts/eval/vanila_mcts_qwen.sh

sh scripts/eval/vanila_mcts_llama.sh

Acknowledgements

This project builds upon the contributions of several open-source efforts:

  • OpenR: for providing core functionalities for evaluation and baseline methods.
  • LLaMA and Qwen: for the pretrained language models used in the inference process.

We thank the developers and maintainers of these projects for their excellent work.

Citation

If you find this project helpful for your research, please consider citing our work:

@misc{wang2025acceleratinglargelanguagemodel,
      title={Accelerating Large Language Model Reasoning via Speculative Search}, 
      author={Zhihai Wang and Jie Wang and Jilai Pan and Xilin Xia and Huiling Zhen and Mingxuan Yuan and Jianye Hao and Feng Wu},
      year={2025},
      eprint={2505.02865},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.02865}, 
}

About

This is the source code of our ICML25 paper, titled "Accelerating Large Language Model Reasoning via Speculative Search".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •