Skip to content

Add Eagle-3 Qwen support (follow-up to #20436) #21025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from

Conversation

rahul-tuli
Copy link
Contributor

@rahul-tuli rahul-tuli commented Jul 16, 2025

Summary

This PR adds support for Eagle-3 speculative decoding with Qwen models, as a follow-up to #20436.

The changes enable Qwen models to work with Eagle-3 draft models in speculators format.

Related Issues

Changes

  • Added Qwen support to Eagle-3 implementation
  • Proper handling of Qwen model architecture in Eagle-3 speculative decoding

Testing

Tested with the following verification script:

#\!/usr/bin/env python3
"""
Verification script for Qwen Eagle-3 support.
"""

from vllm import LLM, SamplingParams

# Initialize vLLM with Qwen model and Eagle-3 draft model
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    speculative_config={
        "method": "eagle",
        "model": "nm-testing/Qwen3-8B-Eagle3-speculators-converted",
        "num_speculative_tokens": 5
    },
    max_model_len=1024,
    enforce_eager=True
)

# Test generation
outputs = llm.generate(
    ["The future of AI is"], 
    SamplingParams(max_tokens=20, temperature=0)
)

print(f"Generated: {outputs[0].outputs[0].text}")
print("✅ Eagle-3 with Qwen works\!")

Output confirms that Eagle-3 speculative decoding works correctly with Qwen models

rahul-tuli and others added 24 commits July 15, 2025 14:11
- Add SpeculatorsEagleConfig to handle speculators config format
- Update config loader to detect speculators Eagle models
- Add weight name remapping in Eagle model load_weights
- Support both standard Eagle and HASS (with layernorms) variants

This enables vLLM to load Eagle models converted using the speculators
library's checkpoint converter, mapping config fields and weight names
to vLLM's expected format.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Remove unused Any, Dict, Optional imports
- Remove unused AutoConfig import
- Keep only Union which is actually used in type annotations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Use PretrainedConfig.get_config_dict() to handle both local and HF paths
- Simplifies the code and follows best practices
- Tested with both local paths and HuggingFace model IDs

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Set method='eagle' in vllm_config to ensure proper model detection
- This field is required by EAGLEConfig parent class
- Helps with future V1 engine compatibility

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Changed method field in vllm_config to use speculators_config.get("speculators_model_type", "eagle")
- This allows the method to be dynamically set based on the speculators model type
- Maintains backward compatibility with default value of "eagle"

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Added check for model_type == "eagle" in SpeculativeConfig auto-detection
- This ensures speculators Eagle models are properly detected and method is set to "eagle"
- Fixes V1 engine compatibility check for speculators Eagle models

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Import is_speculators_eagle_config function
- Add simple check for speculators Eagle models when method is not set
- Minimal change that handles speculators format as a special case
- Fixes issue where speculative_method was None causing V0 fallback

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Added speculators_name_map to handle fusion_fc -> fc weight remapping
- Also handles transformer.* -> model.layers.0.* prefix remapping
- Fixes KeyError for fusion_fc.weight when loading speculators Eagle models
- Similar to the remapping already added to eagle.py model

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Updated llama_eagle.py to skip transformer weights (loaded separately)
- Added num_lookahead_tokens to speculators config (required for Eagle)
- Together these fixes allow speculators Eagle models to work with V1 engine

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Explains all changes needed for speculators Eagle models
- Details the rationale behind each modification
- Includes common questions and answers
- Provides testing examples
- Documents config translation, weight remapping, and V1 detection

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Updated speculators config detection to check for speculators_model_type key
- Support both eagle and eagle3 in is_speculators_eagle_config
- Handle Eagle-3 specific config fields (draft_vocab_size, target_hidden_size)
- Infer target_hidden_size from transformer config if not provided
- Skip non-existent weights in llama_eagle to handle HASS models gracefully
- Eagle-3 models don't need weight translation (already use correct names)

This enables support for:
- nm-testing/eagle3-llama3.1-8b-instruct-speculators
- nm-testing/EAGLE3-LLaMA3.3-Instruct-70B-speculators

While maintaining backward compatibility with Eagle-1 models.

Signed-off-by: rtuli@redhat.com

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Add RMSNorm import and support for enorm/hnorm in llama_eagle.py
- Apply layernorms in forward pass when add_para_norm is enabled
- Handle speculators weight remapping in EagleLlamaForCausalLM.load_weights
- Fixes HASS Eagle models (nm-testing/hass-llama3.1-8b-layernorms) in V1 engine

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Remove redundant model_type field from vllm_config (already defined in EAGLEConfig)
- Extract num_lookahead_tokens from proposal_methods in speculators config
- Add proper assertions for required speculators config structure
- Remove unnecessary intermediate variable speculators_cfg

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
This documentation is no longer needed as the implementation is complete.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Remove V0 engine changes from eagle.py
- Keep V1 engine support in llama_eagle.py with layernorm support
- Maintain config detection and translation for speculators format
- Ensure V1 engine compatibility for all Eagle models

This simplifies the implementation by focusing only on the modern V1 engine
which provides better performance and features.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move SPECULATORS_WEIGHT_MAP to module level to eliminate duplication
- Replace duplicate _remap_weight_name methods with single function
- Fix line continuation style to use proper parentheses
- Streamline weight loading logic while preserving functionality
- Remove verbose comments while keeping essential documentation
- Preserve original 'fc' naming convention

This consolidation improves maintainability and follows vLLM
code style conventions while preserving all existing functionality
for both Eagle-1 and Eagle-3 speculators models.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: rtuli@redhat.com
Co-Authored-By: Claude <noreply@anthropic.com>
- Add weight name mapping for speculators format compatibility
- Support HASS variant with additional layernorms
- Handle both Eagle-1 and Eagle-3 configurations
- Maintain backward compatibility with existing Eagle models

This enables using Eagle draft models packaged with the speculators
library directly in vLLM for speculative decoding.
@mergify mergify bot added llama Related to Llama models qwen Related to Qwen models speculative-decoding labels Jul 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Eagle-3 speculative decoding with Qwen models. The changes are well-structured, introducing a new config format for "speculators" and updating the model loading and execution logic accordingly. A new config class, SpeculatorsEagleConfig, is added to handle the translation from the new format to vLLM's internal format, and the Qwen model is updated to support extracting auxiliary hidden states required by Eagle-3.

I've identified one critical bug in vllm/model_executor/models/qwen2.py related to pipeline parallelism, where the layer index for collecting auxiliary hidden states is calculated incorrectly. A fix has been suggested. The rest of the changes appear to be correct and robust.

Comment on lines +358 to +359
if idx in self.aux_hidden_state_layers:
aux_hidden_states.append(hidden_states + residual)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a bug in how auxiliary hidden states are collected when pipeline parallelism is used. The loop index idx is relative to the slice of layers for the current pipeline stage (self.layers[self.start_layer:self.end_layer]), so it will always start from 0. However, self.aux_hidden_state_layers contains absolute layer indices.

When pipeline parallelism is enabled (pipeline_parallel_size > 1), self.start_layer will be non-zero for later stages, and the condition idx in self.aux_hidden_state_layers will be incorrect. For example, if self.start_layer is 10, idx will start from 0, but self.aux_hidden_state_layers might contain values like 15. The check will fail to identify the correct layer.

To fix this, you should use the absolute layer index for the check. The absolute index is self.start_layer + idx.

Suggested change
if idx in self.aux_hidden_state_layers:
aux_hidden_states.append(hidden_states + residual)
if self.start_layer + idx in self.aux_hidden_state_layers:
aux_hidden_states.append(hidden_states + residual)

@rahul-tuli rahul-tuli marked this pull request as draft July 16, 2025 02:57
@rahul-tuli rahul-tuli closed this Jul 16, 2025
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llama Related to Llama models qwen Related to Qwen models speculative-decoding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant