Skip to content

Enable auto-detection for Eagle speculators format models #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

rahul-tuli
Copy link
Owner

@rahul-tuli rahul-tuli commented Jul 16, 2025

Summary

This PR introduces auto-detection for Eagle models in "speculators" format, dramatically simplifying the UX for
speculative decoding. Users can now simply run vllm serve <speculators-model> and vLLM will automatically configure
everything needed for speculative decoding.

Before (current behavior)

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B", "num_speculative_tokens": 5,
 "draft_tensor_parallel_size": 1}'

After (with this PR)

  vllm serve nm-testing/eagle-llama3.1-8b-instruct

Motivation

Eagle models in "speculators" format contain all the necessary information for speculative decoding in their
configuration:

  • Target model name
  • Speculative decoding method (eagle/eagle3)
  • Number of speculative tokens
  • Draft model architecture

Previously, users had to manually extract this information and construct a complex JSON configuration. This PR automates
that process.

Implementation Approach

  1. Detection Function

Added extract_speculators_info() in vllm/transformers_utils/configs/speculators_eagle.py that:

  • Checks if a model is in speculators format by looking for speculators_model_type in config
  • Extracts target model from either speculators_config.target_config.model_name or
    speculators_config.verifier.name_or_path
  • Returns method, target model, and number of speculative tokens
  1. Auto-Detection in Engine Args

Modified EngineArgs.from_cli_args() to:

  • Check if the provided model is speculators format when no --speculative-config is provided
  • Automatically build the speculative config from extracted metadata
  • Swap the model argument to point to the target model
  • Update the tokenizer to use the target model (fixing tokenizer loading issues)
  • Log the auto-detection process for transparency
  1. New CLI Arguments

Added two new arguments for fine-grained control:

  • --draft-tensor-parallel-size: Set tensor parallel size for the draft model
  • --no-auto-speculative: Disable auto-detection when needed

Examples

Basic Usage

Eagle-1 model

vllm serve nm-testing/eagle-llama3.1-8b-instruct

Eagle-3 model

vllm serve nm-testing/eagle3-llama3.1-8b-instruct-speculators

HASS variant

vllm serve nm-testing/hass-llama3.1-8b-layernorms

With Custom Settings

  # Set draft model tensor parallel size
  vllm serve nm-testing/EAGLE3-LLaMA3.3-Instruct-70B-speculators --draft-tensor-parallel-size 1

Use with other vLLM options

vllm serve nm-testing/eagle3-llama3.1-8b-instruct-speculators \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.8

Opt-Out of Auto-Detection

Disable auto-detection and load as regular model
vllm serve some-speculators-model --no-auto-speculative

Backward Compatibility

# Explicit configuration still works

vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B", "num_speculative_tokens": 
5}'

What Gets Auto-Configured

When a speculators format model is detected, the following happens automatically:

  1. Target model is extracted and set as the main model
  2. Draft model is set to the provided speculators model path
  3. Method is set based on speculators_model_type (eagle/eagle3)
  4. Number of speculative tokens is extracted from config
  5. Tokenizer is updated to use the target model

The auto-detection logs show what's happening:

INFO ... 🦅 Auto-detected Eagle speculators format model
INFO ...   Target model: meta-llama/Meta-Llama-3.1-8B-Instruct
INFO ...   Draft model: nm-testing/eagle3-llama3.1-8b-instruct-speculators
INFO ...   Method: eagle3
INFO ...   Speculative tokens: 5

Testing

Tested with various Eagle models:

  • ✅ Eagle-1 models
  • ✅ Eagle-3 models
  • ✅ HASS variants

Related Issues

This PR builds on top of vllm-project#20436 (Eagle Qwen support) and should be merged after it.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces auto-detection for Eagle models, simplifying the UX for speculative decoding. The changes to EngineArgs and the addition of the extract_speculators_info function are well-implemented. To improve debuggability, I've suggested adding logging to the extract_speculators_info function to surface issues with malformed model configurations.

rahul-tuli and others added 2 commits July 17, 2025 14:17
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@rahul-tuli
Copy link
Owner Author

merged!

@rahul-tuli rahul-tuli closed this Jul 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants