-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Add Eagle-3 Qwen support (follow-up to #20436) #21025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add SpeculatorsEagleConfig to handle speculators config format - Update config loader to detect speculators Eagle models - Add weight name remapping in Eagle model load_weights - Support both standard Eagle and HASS (with layernorms) variants This enables vLLM to load Eagle models converted using the speculators library's checkpoint converter, mapping config fields and weight names to vLLM's expected format. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Remove unused Any, Dict, Optional imports - Remove unused AutoConfig import - Keep only Union which is actually used in type annotations 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Use PretrainedConfig.get_config_dict() to handle both local and HF paths - Simplifies the code and follows best practices - Tested with both local paths and HuggingFace model IDs 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Set method='eagle' in vllm_config to ensure proper model detection - This field is required by EAGLEConfig parent class - Helps with future V1 engine compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Changed method field in vllm_config to use speculators_config.get("speculators_model_type", "eagle") - This allows the method to be dynamically set based on the speculators model type - Maintains backward compatibility with default value of "eagle" Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Added check for model_type == "eagle" in SpeculativeConfig auto-detection - This ensures speculators Eagle models are properly detected and method is set to "eagle" - Fixes V1 engine compatibility check for speculators Eagle models Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Import is_speculators_eagle_config function - Add simple check for speculators Eagle models when method is not set - Minimal change that handles speculators format as a special case - Fixes issue where speculative_method was None causing V0 fallback Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Added speculators_name_map to handle fusion_fc -> fc weight remapping - Also handles transformer.* -> model.layers.0.* prefix remapping - Fixes KeyError for fusion_fc.weight when loading speculators Eagle models - Similar to the remapping already added to eagle.py model Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Updated llama_eagle.py to skip transformer weights (loaded separately) - Added num_lookahead_tokens to speculators config (required for Eagle) - Together these fixes allow speculators Eagle models to work with V1 engine Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Explains all changes needed for speculators Eagle models - Details the rationale behind each modification - Includes common questions and answers - Provides testing examples - Documents config translation, weight remapping, and V1 detection Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Updated speculators config detection to check for speculators_model_type key - Support both eagle and eagle3 in is_speculators_eagle_config - Handle Eagle-3 specific config fields (draft_vocab_size, target_hidden_size) - Infer target_hidden_size from transformer config if not provided - Skip non-existent weights in llama_eagle to handle HASS models gracefully - Eagle-3 models don't need weight translation (already use correct names) This enables support for: - nm-testing/eagle3-llama3.1-8b-instruct-speculators - nm-testing/EAGLE3-LLaMA3.3-Instruct-70B-speculators While maintaining backward compatibility with Eagle-1 models. Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Add RMSNorm import and support for enorm/hnorm in llama_eagle.py - Apply layernorms in forward pass when add_para_norm is enabled - Handle speculators weight remapping in EagleLlamaForCausalLM.load_weights - Fixes HASS Eagle models (nm-testing/hass-llama3.1-8b-layernorms) in V1 engine 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Remove redundant model_type field from vllm_config (already defined in EAGLEConfig) - Extract num_lookahead_tokens from proposal_methods in speculators config - Add proper assertions for required speculators config structure - Remove unnecessary intermediate variable speculators_cfg 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
This documentation is no longer needed as the implementation is complete. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Remove V0 engine changes from eagle.py - Keep V1 engine support in llama_eagle.py with layernorm support - Maintain config detection and translation for speculators format - Ensure V1 engine compatibility for all Eagle models This simplifies the implementation by focusing only on the modern V1 engine which provides better performance and features. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Move SPECULATORS_WEIGHT_MAP to module level to eliminate duplication - Replace duplicate _remap_weight_name methods with single function - Fix line continuation style to use proper parentheses - Streamline weight loading logic while preserving functionality - Remove verbose comments while keeping essential documentation - Preserve original 'fc' naming convention This consolidation improves maintainability and follows vLLM code style conventions while preserving all existing functionality for both Eagle-1 and Eagle-3 speculators models. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: rtuli@redhat.com Co-Authored-By: Claude <noreply@anthropic.com>
- Add weight name mapping for speculators format compatibility - Support HASS variant with additional layernorms - Handle both Eagle-1 and Eagle-3 configurations - Maintain backward compatibility with existing Eagle models This enables using Eagle draft models packaged with the speculators library directly in vLLM for speculative decoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Eagle-3 speculative decoding with Qwen models. The changes are well-structured, introducing a new config format for "speculators" and updating the model loading and execution logic accordingly. A new config class, SpeculatorsEagleConfig
, is added to handle the translation from the new format to vLLM's internal format, and the Qwen model is updated to support extracting auxiliary hidden states required by Eagle-3.
I've identified one critical bug in vllm/model_executor/models/qwen2.py
related to pipeline parallelism, where the layer index for collecting auxiliary hidden states is calculated incorrectly. A fix has been suggested. The rest of the changes appear to be correct and robust.
if idx in self.aux_hidden_state_layers: | ||
aux_hidden_states.append(hidden_states + residual) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bug in how auxiliary hidden states are collected when pipeline parallelism is used. The loop index idx
is relative to the slice of layers for the current pipeline stage (self.layers[self.start_layer:self.end_layer]
), so it will always start from 0. However, self.aux_hidden_state_layers
contains absolute layer indices.
When pipeline parallelism is enabled (pipeline_parallel_size > 1
), self.start_layer
will be non-zero for later stages, and the condition idx in self.aux_hidden_state_layers
will be incorrect. For example, if self.start_layer
is 10, idx
will start from 0, but self.aux_hidden_state_layers
might contain values like 15
. The check will fail to identify the correct layer.
To fix this, you should use the absolute layer index for the check. The absolute index is self.start_layer + idx
.
if idx in self.aux_hidden_state_layers: | |
aux_hidden_states.append(hidden_states + residual) | |
if self.start_layer + idx in self.aux_hidden_state_layers: | |
aux_hidden_states.append(hidden_states + residual) |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Summary
This PR adds support for Eagle-3 speculative decoding with Qwen models, as a follow-up to #20436.
The changes enable Qwen models to work with Eagle-3 draft models in speculators format.
Related Issues
Changes
Testing
Tested with the following verification script:
Output confirms that Eagle-3 speculative decoding works correctly with Qwen models