PARD is a high-performance speculative decoding method that also enables low-cost adaptation of autoregressive draft models into parallel draft models. It offers the following advantages:
-
Low-Cost Training: PARD adapts AR (autoregressive) draft models into parallel draft models with minimal overhead. Compared to pure AR draft models, PARD achieves an average inference speedup of 1.78×. By introducing a conditional drop-token strategy, PARD improves training efficiency by up to 3× while maintaining the same level of accuracy.
-
Generalizability: Thanks to its target-independent design, a single PARD draft model can accelerate an entire family of target models. This contrasts with target-dependent approaches such as Medusa and EAGLE, which require retraining or tuning for each new target. As a result, PARD significantly reduces both deployment complexity and adaptation cost.
-
High Performance: When integrated into an optimized inference framework called Transformers+ PARD delivers up to a 4.08× speedup, with LLaMA3.1 8B reaches a state-of-the-art 311.5 tokens per second. When integrated into vLLM, PARD delivers up to 3.06× speedup, outperforming other speculative decoding methods in vLLM by 1.51×.
- 2025.07.16: Support Qwen3
- 2025.06.30: Support vLLM.
# rocm
rocm/pytorch:rocm6.3.2_ubuntu22.04_py3.10_pytorch_release_2.5.1_preview
# cuda
nvcr.io/nvidia/pytorch:25.02-py3
git clone https://github.com/AMD-AIG-AIMA/PARD
cd PARD
pip3 install -r requirement.txt --no-build-isolation
Model Series | Model Name | Download |
---|---|---|
llama3 | PARD-Llama-3.2-1B | 🤗 HuggingFace |
DSR Qwen | PARD-DeepSeek-R1-Distill-Qwen-1.5B | 🤗 HuggingFace |
Qwen | PARD-Qwen2.5-0.5B | 🤗 HuggingFace |
Qwen3 | PARD-Qwen3-0.6B | 🤗 HuggingFace |
python3 -m pard.infer -c config/eval/llama3_eval.yaml
python3 -m pard.infer -c config/eval/dsrq_eval.yaml
python3 -m pard.infer -c config/eval/qwen_eval.yaml
-
-k
,--draft_k
(default: 12) Specifies the number of draft tokens to be generated in each speculative decoding iteration. Setting this to 0 disables speculative decoding and runs the baseline method instead. -
--tokens
(default: 512) Sets the max number of tokens to during the inference. -
-d
,--draft
(default:'qwen_0.5b_pard'
) The name or path of the draft model. -
-t
,--target
(default:'qwen_2.5_7b'
) The name or path of the target model. -
-b
,--benchmark
(default:'humaneval'
) Specifies the benchmark dataset to use for evaluation. Choices includehumaneval
,gsm8k
andmath500
. -
-ms
,--model_serie
(default: None) Model series of target model. Choices includellama3
,qwen
,r1
andNone
. When set to None, the series will be automatically inferred from the target model's name. -
--para
(flag; default: False) Enables the Parallel Draft model mode. When set to False, an autoregressive (AR) Draft model is used instead. -
--nc
(flag; default: False) Disables torch compile. -
--maxtune
(flag; default: False) Enables maxtune for Target model -
--max_cache_len
(default: None) Sets the maximum cache length for the model. If not provided, it defaults to the value of tokens.
Setup
git clone -b model/integrate-pard-0521 https://github.com/zihaoanllm/vllm.git
cd vllm
# Set up using Python-only build
# Other installation methods can be found in the official vLLM documentation.
VLLM_USE_PRECOMPILED=1 pip install --editable .
Inference
python3 -m utils.vllm_infer
python3 -m pard.train -c config/train/example_qwen.yaml
@article{an2025pard,
title={PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation},
author={An, Zihao and Bai, Huajun and Liu, Ziqiong and Li, Dong and Barsoum, Emad},
journal={arXiv preprint arXiv:2504.18583},
year={2025}
}