Add Eagle-3 Qwen support (follow-up to #20436) #21025

rahul-tuli · 2025-07-16T02:55:38Z

Summary

This PR adds support for Eagle-3 speculative decoding with Qwen models, as a follow-up to #20436.

The changes enable Qwen models to work with Eagle-3 draft models in speculators format.

Related Issues

Follows up on feat: Add support for speculators Eagle checkpoints #20436 which added initial Eagle support for speculators

Changes

Added Qwen support to Eagle-3 implementation
Proper handling of Qwen model architecture in Eagle-3 speculative decoding

Testing

Tested with the following verification script:

#\!/usr/bin/env python3
"""
Verification script for Qwen Eagle-3 support.
"""

from vllm import LLM, SamplingParams

# Initialize vLLM with Qwen model and Eagle-3 draft model
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    speculative_config={
        "method": "eagle",
        "model": "nm-testing/Qwen3-8B-Eagle3-speculators-converted",
        "num_speculative_tokens": 5
    },
    max_model_len=1024,
    enforce_eager=True
)

# Test generation
outputs = llm.generate(
    ["The future of AI is"], 
    SamplingParams(max_tokens=20, temperature=0)
)

print(f"Generated: {outputs[0].outputs[0].text}")
print("✅ Eagle-3 with Qwen works\!")

Output confirms that Eagle-3 speculative decoding works correctly with Qwen models

- Add SpeculatorsEagleConfig to handle speculators config format - Update config loader to detect speculators Eagle models - Add weight name remapping in Eagle model load_weights - Support both standard Eagle and HASS (with layernorms) variants This enables vLLM to load Eagle models converted using the speculators library's checkpoint converter, mapping config fields and weight names to vLLM's expected format. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Remove unused Any, Dict, Optional imports - Remove unused AutoConfig import - Keep only Union which is actually used in type annotations 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Use PretrainedConfig.get_config_dict() to handle both local and HF paths - Simplifies the code and follows best practices - Tested with both local paths and HuggingFace model IDs 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Set method='eagle' in vllm_config to ensure proper model detection - This field is required by EAGLEConfig parent class - Helps with future V1 engine compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Changed method field in vllm_config to use speculators_config.get("speculators_model_type", "eagle") - This allows the method to be dynamically set based on the speculators model type - Maintains backward compatibility with default value of "eagle" Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Added check for model_type == "eagle" in SpeculativeConfig auto-detection - This ensures speculators Eagle models are properly detected and method is set to "eagle" - Fixes V1 engine compatibility check for speculators Eagle models Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Import is_speculators_eagle_config function - Add simple check for speculators Eagle models when method is not set - Minimal change that handles speculators format as a special case - Fixes issue where speculative_method was None causing V0 fallback Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Added speculators_name_map to handle fusion_fc -> fc weight remapping - Also handles transformer.* -> model.layers.0.* prefix remapping - Fixes KeyError for fusion_fc.weight when loading speculators Eagle models - Similar to the remapping already added to eagle.py model Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Updated llama_eagle.py to skip transformer weights (loaded separately) - Added num_lookahead_tokens to speculators config (required for Eagle) - Together these fixes allow speculators Eagle models to work with V1 engine Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Explains all changes needed for speculators Eagle models - Details the rationale behind each modification - Includes common questions and answers - Provides testing examples - Documents config translation, weight remapping, and V1 detection Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Updated speculators config detection to check for speculators_model_type key - Support both eagle and eagle3 in is_speculators_eagle_config - Handle Eagle-3 specific config fields (draft_vocab_size, target_hidden_size) - Infer target_hidden_size from transformer config if not provided - Skip non-existent weights in llama_eagle to handle HASS models gracefully - Eagle-3 models don't need weight translation (already use correct names) This enables support for: - nm-testing/eagle3-llama3.1-8b-instruct-speculators - nm-testing/EAGLE3-LLaMA3.3-Instruct-70B-speculators While maintaining backward compatibility with Eagle-1 models. Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Add RMSNorm import and support for enorm/hnorm in llama_eagle.py - Apply layernorms in forward pass when add_para_norm is enabled - Handle speculators weight remapping in EagleLlamaForCausalLM.load_weights - Fixes HASS Eagle models (nm-testing/hass-llama3.1-8b-layernorms) in V1 engine 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Remove redundant model_type field from vllm_config (already defined in EAGLEConfig) - Extract num_lookahead_tokens from proposal_methods in speculators config - Add proper assertions for required speculators config structure - Remove unnecessary intermediate variable speculators_cfg 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

This documentation is no longer needed as the implementation is complete. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Remove V0 engine changes from eagle.py - Keep V1 engine support in llama_eagle.py with layernorm support - Maintain config detection and translation for speculators format - Ensure V1 engine compatibility for all Eagle models This simplifies the implementation by focusing only on the modern V1 engine which provides better performance and features. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Move SPECULATORS_WEIGHT_MAP to module level to eliminate duplication - Replace duplicate _remap_weight_name methods with single function - Fix line continuation style to use proper parentheses - Streamline weight loading logic while preserving functionality - Remove verbose comments while keeping essential documentation - Preserve original 'fc' naming convention This consolidation improves maintainability and follows vLLM code style conventions while preserving all existing functionality for both Eagle-1 and Eagle-3 speculators models. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: rtuli@redhat.com Co-Authored-By: Claude <noreply@anthropic.com>

- Add weight name mapping for speculators format compatibility - Support HASS variant with additional layernorms - Handle both Eagle-1 and Eagle-3 configurations - Maintain backward compatibility with existing Eagle models This enables using Eagle draft models packaged with the speculators library directly in vLLM for speculative decoding.

gemini-code-assist

Code Review

This pull request adds support for Eagle-3 speculative decoding with Qwen models. The changes are well-structured, introducing a new config format for "speculators" and updating the model loading and execution logic accordingly. A new config class, SpeculatorsEagleConfig, is added to handle the translation from the new format to vLLM's internal format, and the Qwen model is updated to support extracting auxiliary hidden states required by Eagle-3.

I've identified one critical bug in vllm/model_executor/models/qwen2.py related to pipeline parallelism, where the layer index for collecting auxiliary hidden states is calculated incorrectly. A fix has been suggested. The rest of the changes appear to be correct and robust.

gemini-code-assist · 2025-07-16T02:57:12Z

vllm/model_executor/models/qwen2.py

+            if idx in self.aux_hidden_state_layers:
+                aux_hidden_states.append(hidden_states + residual)


There's a bug in how auxiliary hidden states are collected when pipeline parallelism is used. The loop index idx is relative to the slice of layers for the current pipeline stage (self.layers[self.start_layer:self.end_layer]), so it will always start from 0. However, self.aux_hidden_state_layers contains absolute layer indices.

When pipeline parallelism is enabled (pipeline_parallel_size > 1), self.start_layer will be non-zero for later stages, and the condition idx in self.aux_hidden_state_layers will be incorrect. For example, if self.start_layer is 10, idx will start from 0, but self.aux_hidden_state_layers might contain values like 15. The check will fail to identify the correct layer.

To fix this, you should use the absolute layer index for the check. The absolute index is self.start_layer + idx.

Suggested change

if idx in self.aux_hidden_state_layers:

aux_hidden_states.append(hidden_states + residual)

if self.start_layer + idx in self.aux_hidden_state_layers:

aux_hidden_states.append(hidden_states + residual)

github-actions · 2025-07-16T03:16:54Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

rahul-tuli and others added 24 commits July 15, 2025 14:11

chore: Remove V1 engine Eagle support documentation

e4e87fb

This documentation is no longer needed as the implementation is complete. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

feat: Comprehensive code cleanup for speculators Eagle support

95f6069

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

remove changes to gitignore

d63ef14

add back .gitignore

b905811

Add norm_before_residual support for llama_eagle3.py

7df8c9d

Fix bug

1408fb8

simplify logic

46e398a

Add: preliminary qwen support

8724f40

rahul-tuli requested review from simon-mo, WoosukKwon, youkaichao, robertgshaw2-redhat, mgoin and tlrmchlsmth as code owners July 16, 2025 02:55

rahul-tuli requested review from houseroad and hmellor as code owners July 16, 2025 02:55

mergify bot added llama Related to Llama models qwen Related to Qwen models speculative-decoding labels Jul 16, 2025

gemini-code-assist bot reviewed Jul 16, 2025

View reviewed changes

rahul-tuli marked this pull request as draft July 16, 2025 02:57

rahul-tuli closed this Jul 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add Eagle-3 Qwen support (follow-up to #20436) #21025

Add Eagle-3 Qwen support (follow-up to #20436) #21025

rahul-tuli commented Jul 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 16, 2025

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

Uh oh!

		if idx in self.aux_hidden_state_layers:
		aux_hidden_states.append(hidden_states + residual)

Uh oh!

Add Eagle-3 Qwen support (follow-up to #20436) #21025

Add Eagle-3 Qwen support (follow-up to #20436) #21025

Conversation

rahul-tuli commented Jul 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Changes

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

Uh oh!

rahul-tuli commented Jul 16, 2025 •

edited by github-actions bot

Loading