Enable auto-detection for Eagle speculators format models #3

rahul-tuli · 2025-07-16T07:29:36Z

Summary

This PR introduces auto-detection for Eagle models in "speculators" format, dramatically simplifying the UX for
speculative decoding. Users can now simply run vllm serve <speculators-model> and vLLM will automatically configure
everything needed for speculative decoding.

Before (current behavior)

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B", "num_speculative_tokens": 5,
 "draft_tensor_parallel_size": 1}'

After (with this PR)

  vllm serve nm-testing/eagle-llama3.1-8b-instruct

Motivation

Eagle models in "speculators" format contain all the necessary information for speculative decoding in their
configuration:

Target model name
Speculative decoding method (eagle/eagle3)
Number of speculative tokens
Draft model architecture

Previously, users had to manually extract this information and construct a complex JSON configuration. This PR automates
that process.

Implementation Approach

Detection Function

Added extract_speculators_info() in vllm/transformers_utils/configs/speculators_eagle.py that:

Checks if a model is in speculators format by looking for speculators_model_type in config
Extracts target model from either speculators_config.target_config.model_name or
speculators_config.verifier.name_or_path
Returns method, target model, and number of speculative tokens

Auto-Detection in Engine Args

Modified EngineArgs.from_cli_args() to:

Check if the provided model is speculators format when no --speculative-config is provided
Automatically build the speculative config from extracted metadata
Swap the model argument to point to the target model
Update the tokenizer to use the target model (fixing tokenizer loading issues)
Log the auto-detection process for transparency

New CLI Arguments

Added two new arguments for fine-grained control:

--draft-tensor-parallel-size: Set tensor parallel size for the draft model
--no-auto-speculative: Disable auto-detection when needed

Examples

Basic Usage

Eagle-1 model

vllm serve nm-testing/eagle-llama3.1-8b-instruct

Eagle-3 model

vllm serve nm-testing/eagle3-llama3.1-8b-instruct-speculators

HASS variant

vllm serve nm-testing/hass-llama3.1-8b-layernorms

With Custom Settings

  # Set draft model tensor parallel size
  vllm serve nm-testing/EAGLE3-LLaMA3.3-Instruct-70B-speculators --draft-tensor-parallel-size 1

Use with other vLLM options

vllm serve nm-testing/eagle3-llama3.1-8b-instruct-speculators \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.8

Opt-Out of Auto-Detection

Disable auto-detection and load as regular model
vllm serve some-speculators-model --no-auto-speculative

Backward Compatibility

# Explicit configuration still works

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B", "num_speculative_tokens": 
5}'

What Gets Auto-Configured

When a speculators format model is detected, the following happens automatically:

Target model is extracted and set as the main model
Draft model is set to the provided speculators model path
Method is set based on speculators_model_type (eagle/eagle3)
Number of speculative tokens is extracted from config
Tokenizer is updated to use the target model

The auto-detection logs show what's happening:

INFO ... 🦅 Auto-detected Eagle speculators format model
INFO ...   Target model: meta-llama/Meta-Llama-3.1-8B-Instruct
INFO ...   Draft model: nm-testing/eagle3-llama3.1-8b-instruct-speculators
INFO ...   Method: eagle3
INFO ...   Speculative tokens: 5

Testing

Tested with various Eagle models:

✅ Eagle-1 models
✅ Eagle-3 models
✅ HASS variants

Related Issues

This PR builds on top of vllm-project#20436 (Eagle Qwen support) and should be merged after it.

github-actions · 2025-07-16T07:29:45Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This PR introduces auto-detection for Eagle models, simplifying the UX for speculative decoding. The changes to EngineArgs and the addition of the extract_speculators_info function are well-implemented. To improve debuggability, I've suggested adding logging to the extract_speculators_info function to surface issues with malformed model configurations.

vllm/transformers_utils/configs/speculators_eagle.py

vllm/engine/arg_utils.py

vllm/transformers_utils/configs/speculators_eagle.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

rahul-tuli · 2025-07-17T18:51:10Z

merged!

Enable: vllm serve <speculators_model>

26a74b8

gemini-code-assist bot reviewed Jul 16, 2025

View reviewed changes

vllm/transformers_utils/configs/speculators_eagle.py Outdated Show resolved Hide resolved

dsikka reviewed Jul 16, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

vllm/engine/arg_utils.py Show resolved Hide resolved

vllm/transformers_utils/configs/speculators_eagle.py Outdated Show resolved Hide resolved

dsikka mentioned this pull request Jul 16, 2025

[Examples] Add Eagle and Eagle3 Conversion Scripts neuralmagic/speculators#56

Open

rahul-tuli and others added 2 commits July 17, 2025 14:17

Remove support to disable autodetection

984ba42

Update vllm/transformers_utils/configs/speculators_eagle.py

0c5de74

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

rahul-tuli closed this Jul 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable auto-detection for Eagle speculators format models #3

Enable auto-detection for Eagle speculators format models #3

Uh oh!

rahul-tuli commented Jul 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahul-tuli commented Jul 17, 2025

Uh oh!

Uh oh!

Enable auto-detection for Eagle speculators format models #3

Enable auto-detection for Eagle speculators format models #3

Uh oh!

Conversation

rahul-tuli commented Jul 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Before (current behavior)

Motivation

Implementation Approach

Eagle-1 model

Eagle-3 model

HASS variant

Use with other vLLM options

Opt-Out of Auto-Detection

What Gets Auto-Configured

Testing

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahul-tuli commented Jul 17, 2025

Uh oh!

Uh oh!

rahul-tuli commented Jul 16, 2025 •

edited by github-actions bot

Loading