Fast Large Language Model Collaborative Decoding via Speculation

Collaborative decoding via Speculation (CoS) is a novel framework that accelerates the collaboration, e.g. weighted ensemble or contrastive decoding, of multiple LLMs without sacrificing performance. It achieves a performance boost of 1.11x to 2.23x over standard collaboration methods in two- or three-model configurations.

News

[2025/5/29] 🚀 Our paper is renamed from Speculative Ensemble: Fast Large Language Model Ensemble via Speculation to Fast Large Language Model Collaborative Decoding via Speculation.
[2025/5/29] We release NPU version of CoS at npu branch. It is implemented in full transfomers and PyTorch manner, which is easier to understand and read.
[2025/5/1] ✨ Our paper is accepted on ICML 2025.
[2025/2/1] We release paper on arXiv.

Setup

# Create and activate the environment
conda create -n cos python=3.11 -y
conda activate cos

# Install vllm
cd vllm
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://files.pythonhosted.org/packages/c8/f4/e108a902ccad131d8978a9376343a6e95d78d0e12f152a796794647073ec/vllm-0.6.5-cp38-abi3-manylinux1_x86_64.whl
pip install --editable .

# Install the remaining dependencies
cd ..
pip install -r requirements.txt

How to Run

Create a .env file to specify the root path of your models.
```
MODEL_PATH=xxx
```

Run an example:

CUDA_VISIBLE_DEVICES=0 python ./main_dataset.py \
    dataset.name=humaneval \
    dataset.size=tiny \
    method=sd \
    method.model="llama-2-7b" \
    method.draft_model="llama-2-7b-68m" \
    method.gamma=5 \
    method.generate.temperature=0 \

Code Reading Guides

Chef is the internal name for the CoS implementation. The code is located in vllm/vllm/chef, while the baseline ensemble implementation can be found at vllm/vllm/ensemble_decode.

We have implemented multiple model inference methods. The configuration files are located in configs/method, and the desired method can be specified via method={method_name} (see Step 2 of How to run). Annotations are as follows:

Method	Description	Args Note
`large_model`	Inference using a single model in an autoregressive manner	-
`cd`	Contrastive decoding with two models	Requires (`method.amateur_model`) and (`method.alpha`)
`we`	Weighted ensemble with two models	Requires (`method.extra_model`) and (`method.lambda`)
`*_sd`	Accelerates the ensemble directly using speculative decoding	Inherits from specific ensemble methods, with `method.gamma` as an additional hyperparameter
`*_chef`	Accelerates the ensemble using speculative ensemble	Inherits from specific ensemble methods. For ensembles with more than two models, `method.gamma` should be a list of integers

Customization

Customizing an Ensemble Method

Create a YAML file, such as configs/method/your_ens_method.yaml, referencing configs/method/we.yaml.
Customize the method.extra_model parameter (as a string or a list of strings) and any additional parameters if needed.
Modify the llm.ensemble_fn and llm.ensemble_target to define the ensemble function.
Finally, use method=your_ens_method to run your custom ensemble method.

Customizing a Dataset

Create a directory src/mydatasets/your_dataset/ and a file src/mydatasets/your_dataset/mydataset.py that inherits DatasetBase.
Use dataset=your_dataset to run your custom dataset.

Citation

@inproceedings{fu2025speculative,
  title={Fast Large Language Model Collaborative Decoding via Speculation},
  author={Fu, Jiale and Jiang, Yuchu and Chen, Junkai and Fan, Jiaming and Geng, Xin and Yang, Xu},
  booktitle={Forty-two International Conference on Machine Learning},
  year={2025}
}

Acknowledgements

We built our implementation upon the VLLM project and would like to thank the authors for their outstanding contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
configs		configs
src		src
vllm		vllm
.gitignore		.gitignore
main_case.py		main_case.py
main_dataset.py		main_dataset.py
pipeline.py		pipeline.py
readme.md		readme.md
requirements.txt		requirements.txt
run_case.sh		run_case.sh
run_dataset.sh		run_dataset.sh
run_pipeline.sh		run_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fast Large Language Model Collaborative Decoding via Speculation

News

Setup

How to Run

Code Reading Guides

Customization

Customizing an Ensemble Method

Customizing a Dataset

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Kamichanw/CoS

Folders and files

Latest commit

History

Repository files navigation

Fast Large Language Model Collaborative Decoding via Speculation

News

Setup

How to Run

Code Reading Guides

Customization

Customizing an Ensemble Method

Customizing a Dataset

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages