Skip to content
/ CoS Public

[ICML'25] Official code of paper "Fast Large Language Model Collaborative Decoding via Speculation"

Notifications You must be signed in to change notification settings

Kamichanw/CoS

Repository files navigation

Fast Large Language Model Collaborative Decoding via Speculation

Static Badge Static Badge Static Badge Static Badge

Collaborative decoding via Speculation (CoS) is a novel framework that accelerates the collaboration, e.g. weighted ensemble or contrastive decoding, of multiple LLMs without sacrificing performance. It achieves a performance boost of 1.11x to 2.23x over standard collaboration methods in two- or three-model configurations.

News

  • [2025/5/29] 🚀 Our paper is renamed from Speculative Ensemble: Fast Large Language Model Ensemble via Speculation to Fast Large Language Model Collaborative Decoding via Speculation.

  • [2025/5/29] We release NPU version of CoS at npu branch. It is implemented in full transfomers and PyTorch manner, which is easier to understand and read.

  • [2025/5/1] ✨ Our paper is accepted on ICML 2025.

  • [2025/2/1] We release paper on arXiv.

Setup

# Create and activate the environment
conda create -n cos python=3.11 -y
conda activate cos

# Install vllm
cd vllm
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://files.pythonhosted.org/packages/c8/f4/e108a902ccad131d8978a9376343a6e95d78d0e12f152a796794647073ec/vllm-0.6.5-cp38-abi3-manylinux1_x86_64.whl
pip install --editable .

# Install the remaining dependencies
cd ..
pip install -r requirements.txt

How to Run

  1. Create a .env file to specify the root path of your models.

    MODEL_PATH=xxx
    
  2. Run an example:

    CUDA_VISIBLE_DEVICES=0 python ./main_dataset.py \
        dataset.name=humaneval \
        dataset.size=tiny \
        method=sd \
        method.model="llama-2-7b" \
        method.draft_model="llama-2-7b-68m" \
        method.gamma=5 \
        method.generate.temperature=0 \

Code Reading Guides

Chef is the internal name for the CoS implementation. The code is located in vllm/vllm/chef, while the baseline ensemble implementation can be found at vllm/vllm/ensemble_decode.

We have implemented multiple model inference methods. The configuration files are located in configs/method, and the desired method can be specified via method={method_name} (see Step 2 of How to run). Annotations are as follows:

Method Description Args Note
large_model Inference using a single model in an autoregressive manner -
cd Contrastive decoding with two models Requires (method.amateur_model) and (method.alpha)
we Weighted ensemble with two models Requires (method.extra_model) and (method.lambda)
*_sd Accelerates the ensemble directly using speculative decoding Inherits from specific ensemble methods, with method.gamma as an additional hyperparameter
*_chef Accelerates the ensemble using speculative ensemble Inherits from specific ensemble methods. For ensembles with more than two models, method.gamma should be a list of integers

Customization

Customizing an Ensemble Method

  1. Create a YAML file, such as configs/method/your_ens_method.yaml, referencing configs/method/we.yaml.
  2. Customize the method.extra_model parameter (as a string or a list of strings) and any additional parameters if needed.
  3. Modify the llm.ensemble_fn and llm.ensemble_target to define the ensemble function.
  4. Finally, use method=your_ens_method to run your custom ensemble method.

Customizing a Dataset

  1. Create a directory src/mydatasets/your_dataset/ and a file src/mydatasets/your_dataset/mydataset.py that inherits DatasetBase.
  2. Use dataset=your_dataset to run your custom dataset.

Citation

@inproceedings{fu2025speculative,
  title={Fast Large Language Model Collaborative Decoding via Speculation},
  author={Fu, Jiale and Jiang, Yuchu and Chen, Junkai and Fan, Jiaming and Geng, Xin and Yang, Xu},
  booktitle={Forty-two International Conference on Machine Learning},
  year={2025}
}

Acknowledgements

We built our implementation upon the VLLM project and would like to thank the authors for their outstanding contributions.

About

[ICML'25] Official code of paper "Fast Large Language Model Collaborative Decoding via Speculation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published