Enable auto-detection for Eagle speculators format models #3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces auto-detection for Eagle models in "speculators" format, dramatically simplifying the UX for
speculative decoding. Users can now simply run
vllm serve <speculators-model>
and vLLM will automatically configureeverything needed for speculative decoding.
Before (current behavior)
After (with this PR)
Motivation
Eagle models in "speculators" format contain all the necessary information for speculative decoding in their
configuration:
Previously, users had to manually extract this information and construct a complex JSON configuration. This PR automates
that process.
Implementation Approach
Added
extract_speculators_info()
invllm/transformers_utils/configs/speculators_eagle.py
that:speculators_model_type
in configspeculators_config.target_config.model_name
orspeculators_config.verifier.name_or_path
Modified
EngineArgs.from_cli_args()
to:--speculative-config
is providedAdded two new arguments for fine-grained control:
--draft-tensor-parallel-size
: Set tensor parallel size for the draft model--no-auto-speculative
: Disable auto-detection when neededExamples
Basic Usage
Eagle-1 model
vllm serve nm-testing/eagle-llama3.1-8b-instruct
Eagle-3 model
vllm serve
nm-testing/eagle3-llama3.1-8b-instruct-speculators
HASS variant
vllm serve nm-testing/hass-llama3.1-8b-layernorms
With Custom Settings
# Set draft model tensor parallel size vllm serve nm-testing/EAGLE3-LLaMA3.3-Instruct-70B-speculators --draft-tensor-parallel-size 1
Use with other vLLM options
Opt-Out of Auto-Detection
Disable auto-detection and load as regular model
vllm serve some-speculators-model --no-auto-speculative
Backward Compatibility
What Gets Auto-Configured
When a speculators format model is detected, the following happens automatically:
speculators_model_type
(eagle/eagle3)The auto-detection logs show what's happening:
Testing
Tested with various Eagle models:
Related Issues
This PR builds on top of vllm-project#20436 (Eagle Qwen support) and should be merged after it.