-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
feat: Add support for speculators Eagle checkpoints #20436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add support for speculators Eagle checkpoints #20436
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @rahul-tuli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces compatibility for Eagle models that have been converted using the speculators
library. It establishes a robust translation layer within vLLM to seamlessly handle differences in configuration and weight naming conventions between the speculators
ecosystem and vLLM's native Eagle implementation, enabling speculative decoding with these models.
Highlights
- Support for Speculators Eagle Models: This pull request enables vLLM to load and utilize Eagle models that have been converted using the
speculators
library. This bridges the compatibility gap between thespeculators
ecosystem and vLLM's native Eagle implementation, allowing for speculative decoding with these models. - Configuration Translation Layer: A new
SpeculatorsEagleConfig
class has been introduced. This class automatically detectsspeculators
-specific configuration formats (e.g.,speculators_model_type
,transformer_layer_config
,fusion_bias
,layernorms
) and translates them into vLLM's expected Eagle configuration format during model loading. - Automated Weight Remapping: Logic has been added to the
EAGLE.load_weights()
method to transparently remap weight names. This handles the differences betweenspeculators
' naming conventions (e.g.,fusion_fc.weight
,transformer.
) and vLLM's internal naming scheme, ensuring that weights are loaded correctly without manual intervention. - Seamless Integration and Backward Compatibility: The
get_config()
utility has been updated to automatically detectspeculators
Eagle models based on their configuration. These models are then routed through the new translation layer. This integration is designed to be additive, ensuring full backward compatibility with existing vLLM Eagle models.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for loading Eagle models from the speculators
library by introducing a configuration adapter and weight remapping. The overall approach is sound. The review focuses on the new SpeculatorsEagleConfig
implementation and identifies critical issues related to handling remote models from the Hugging Face Hub, which would prevent the feature from working in a common use case. Detailed suggestions are provided to fix these issues by properly fetching remote configurations, along with a minor suggestion to improve code maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you post steps to run?
I wasn't able to run the verification script on this branch with speculators main.
@@ -334,6 +336,17 @@ def get_config( | |||
raise ValueError(error_message) from e | |||
|
|||
if config_format == ConfigFormat.HF: | |||
# Check if this is a speculators Eagle model | |||
if is_speculators_eagle_config(model): | |||
config = SpeculatorsEagleConfig.from_pretrained( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all existing supported models just going through the PretrainedConfig pathway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
We don't need speculators to run the models; here are the steps: Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39 There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md for example convert an existing eagle-checkpoint using: speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct Output:
The converted checkpoint will be saved in Step 2: Checkout the current branch in vllm, and run the model: from vllm import LLM, SamplingParams
# UPDATE THIS PATH
eagle_model_path = "/home/rahul/speculators/converted/eagle"
print("Loading models...")
# Create LLM with Eagle speculative decoding
llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct", # target/verifier model
speculative_config={
"model": eagle_model_path, # Your Eagle model path
"num_speculative_tokens": 5, # Number of tokens to predict ahead
},
trust_remote_code=True,
gpu_memory_utilization=0.4,
max_model_len=1024,
)
print("Models loaded! Generating text...")
# Your prompt
prompt = "The benefits of open source software include"
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=100,
)
# Generate text
output = llm.generate([prompt], sampling_params)[0]
generated_text = output.outputs[0].text
print(f"\nPrompt: {prompt}")
print(f"Generated: {generated_text}") Output: ...(truncated for brevity)...
Models loaded! Generating text...
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]
Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00, 2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]
Prompt: The benefits of open source software include
Generated: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
2. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
3. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
4. Security: Open source software can be more secure
[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) |
cli does not seem to be working on that branch. |
Could you paste the command you used, and the trace back on that PR? |
I’ll touch base offline. It just didn’t recognize the speculators command |
transformer_config["architectures"] = [arch] | ||
|
||
# Build vLLM config | ||
vllm_config = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need to add the verifier model as part of the config? How are the two differentiated in the vllm_config object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
This pull request has merge conflicts that must be resolved before it can be |
e9fecc1
to
8e1183c
Compare
85a11c7
to
81c9904
Compare
9f730f4
to
875b786
Compare
@@ -22,6 +23,23 @@ | |||
|
|||
logger = init_logger(__name__) | |||
|
|||
# Weight name mapping for speculators format compatibility | |||
SPECULATORS_WEIGHT_MAP = { | |||
"fusion_fc.weight": "fc.weight", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though we removed this translation in speculators?
DEFAULT_NUM_LOOKAHEAD_TOKENS = 5 | ||
|
||
|
||
class SpeculatorsEagleConfig(EAGLEConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we change this from SpeculatorsConfig?
- Add SpeculatorsEagleConfig to handle speculators config format - Update config loader to detect speculators Eagle models - Add weight name remapping in Eagle model load_weights - Support both standard Eagle and HASS (with layernorms) variants This enables vLLM to load Eagle models converted using the speculators library's checkpoint converter, mapping config fields and weight names to vLLM's expected format. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Remove unused Any, Dict, Optional imports - Remove unused AutoConfig import - Keep only Union which is actually used in type annotations 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Use PretrainedConfig.get_config_dict() to handle both local and HF paths - Simplifies the code and follows best practices - Tested with both local paths and HuggingFace model IDs 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Set method='eagle' in vllm_config to ensure proper model detection - This field is required by EAGLEConfig parent class - Helps with future V1 engine compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Changed method field in vllm_config to use speculators_config.get("speculators_model_type", "eagle") - This allows the method to be dynamically set based on the speculators model type - Maintains backward compatibility with default value of "eagle" Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Added check for model_type == "eagle" in SpeculativeConfig auto-detection - This ensures speculators Eagle models are properly detected and method is set to "eagle" - Fixes V1 engine compatibility check for speculators Eagle models Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Import is_speculators_eagle_config function - Add simple check for speculators Eagle models when method is not set - Minimal change that handles speculators format as a special case - Fixes issue where speculative_method was None causing V0 fallback Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Added speculators_name_map to handle fusion_fc -> fc weight remapping - Also handles transformer.* -> model.layers.0.* prefix remapping - Fixes KeyError for fusion_fc.weight when loading speculators Eagle models - Similar to the remapping already added to eagle.py model Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Updated llama_eagle.py to skip transformer weights (loaded separately) - Added num_lookahead_tokens to speculators config (required for Eagle) - Together these fixes allow speculators Eagle models to work with V1 engine Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Explains all changes needed for speculators Eagle models - Details the rationale behind each modification - Includes common questions and answers - Provides testing examples - Documents config translation, weight remapping, and V1 detection Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Updated speculators config detection to check for speculators_model_type key - Support both eagle and eagle3 in is_speculators_eagle_config - Handle Eagle-3 specific config fields (draft_vocab_size, target_hidden_size) - Infer target_hidden_size from transformer config if not provided - Skip non-existent weights in llama_eagle to handle HASS models gracefully - Eagle-3 models don't need weight translation (already use correct names) This enables support for: - nm-testing/eagle3-llama3.1-8b-instruct-speculators - nm-testing/EAGLE3-LLaMA3.3-Instruct-70B-speculators While maintaining backward compatibility with Eagle-1 models. Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Add RMSNorm import and support for enorm/hnorm in llama_eagle.py - Apply layernorms in forward pass when add_para_norm is enabled - Handle speculators weight remapping in EagleLlamaForCausalLM.load_weights - Fixes HASS Eagle models (nm-testing/hass-llama3.1-8b-layernorms) in V1 engine 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Remove redundant model_type field from vllm_config (already defined in EAGLEConfig) - Extract num_lookahead_tokens from proposal_methods in speculators config - Add proper assertions for required speculators config structure - Remove unnecessary intermediate variable speculators_cfg 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
This documentation is no longer needed as the implementation is complete. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
- Remove V0 engine changes from eagle.py - Keep V1 engine support in llama_eagle.py with layernorm support - Maintain config detection and translation for speculators format - Ensure V1 engine compatibility for all Eagle models This simplifies the implementation by focusing only on the modern V1 engine which provides better performance and features. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>
🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Move SPECULATORS_WEIGHT_MAP to module level to eliminate duplication - Replace duplicate _remap_weight_name methods with single function - Fix line continuation style to use proper parentheses - Streamline weight loading logic while preserving functionality - Remove verbose comments while keeping essential documentation - Preserve original 'fc' naming convention This consolidation improves maintainability and follows vLLM code style conventions while preserving all existing functionality for both Eagle-1 and Eagle-3 speculators models. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: rtuli@redhat.com Co-Authored-By: Claude <noreply@anthropic.com>
- Add weight name mapping for speculators format compatibility - Support HASS variant with additional layernorms - Handle both Eagle-1 and Eagle-3 configurations - Maintain backward compatibility with existing Eagle models This enables using Eagle draft models packaged with the speculators library directly in vLLM for speculative decoding.
36b8502
to
00da923
Compare
@@ -159,6 +204,11 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]): | |||
|
|||
model_weights = {} | |||
for name, loaded_weight in weights: | |||
remapped_name = remap_speculators_weight_name(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be under a check where we first check if there's a speculators config present in self.config?
Summary
This PR adds support for loading Eagle models converted using the speculators library's checkpoint converter. This enables vLLM to use speculators-format Eagle models for speculative decoding, bridging the gap between the speculators ecosystem and vLLM.
Technical Details
Problem Statement
The speculators library provides a unified framework for speculative decoding models, but uses a different configuration and weight naming convention than vLLM's native Eagle implementation. Key differences include:
Configuration Format:
{"speculators_model_type": "eagle", "transformer_layer_config": {...}, "fusion_bias": bool, "layernorms": bool}
{"model_type": "eagle", "model": {...}, "eagle_fc_bias": bool, "model.add_para_norm": bool}
Weight Naming:
fusion_fc.weight
,embedding_layernorm.weight
fc.weight
,enorm.weight
Solution Architecture
This implementation provides a translation layer that:
Config Adapter (
SpeculatorsEagleConfig
):speculators_model_type
fieldWeight Remapping:
EAGLE.load_weights()
for transparent operationAutomatic Detection:
get_config()
to detect and route speculators configsImplementation Details
Files Modified
vllm/transformers_utils/configs/speculators_eagle.py
(new):vllm/model_executor/models/eagle.py
:vllm/transformers_utils/config.py
:SpeculatorsEagleConfig
when detectedTesting
Models Tested
1. Standard Eagle (without layernorms)
nm-testing/eagle-llama3.1-8b-instruct
{"layernorms": false, "fusion_bias": false}
2. HASS Variant (with layernorms)
nm-testing/hass-llama3.1-8b-layernorms
{"layernorms": true, "fusion_bias": false}
embedding_layernorm
,pre_lm_head_layernorm
Test Results
Both models successfully:
Verification Script
Verification Output
Future Work
Phase 1: Direct Model Serving (Future PR)
Enable
vllm serve <speculators_model>
by automatically configuring speculative decoding:Implementation approach:
SpeculatorsModelConfig
that reads verifier info from speculators configEngineArgs
to auto-configure when detecting speculators modelsPhase 2: Extended Speculators Support
Phase 3: Ecosystem Integration
Design Decisions
load_weights()
for transparency and minimal code changesspeculators_model_type
field as definitive indicatorlayernorms
field, mapped to vLLM'sadd_para_norm
Performance Impact
Dependencies
No new dependencies required. Uses existing vLLM and transformers infrastructure.
Checklist
References
This is a draft PR for initial feedback.