feat: Add support for speculators Eagle checkpoints #20436

rahul-tuli · 2025-07-03T11:32:29Z

Summary

This PR adds support for loading Eagle models converted using the speculators library's checkpoint converter. This enables vLLM to use speculators-format Eagle models for speculative decoding, bridging the gap between the speculators ecosystem and vLLM.

Technical Details

Problem Statement

The speculators library provides a unified framework for speculative decoding models, but uses a different configuration and weight naming convention than vLLM's native Eagle implementation. Key differences include:

Configuration Format:
- Speculators: {"speculators_model_type": "eagle", "transformer_layer_config": {...}, "fusion_bias": bool, "layernorms": bool}
- vLLM: {"model_type": "eagle", "model": {...}, "eagle_fc_bias": bool, "model.add_para_norm": bool}
Weight Naming:
- Speculators uses descriptive names: fusion_fc.weight, embedding_layernorm.weight
- vLLM uses compact names: fc.weight, enorm.weight

Solution Architecture

This implementation provides a translation layer that:

Config Adapter (SpeculatorsEagleConfig):
- Detects speculators format via speculators_model_type field
- Translates configuration fields during model loading
- Preserves all Eagle functionality (fusion bias, layernorms/HASS support)
Weight Remapping:
- Implemented in EAGLE.load_weights() for transparent operation
- Maps speculators weight names to vLLM expected names
- Handles transformer layer prefix differences
Automatic Detection:
- Modified get_config() to detect and route speculators configs
- Maintains backward compatibility with existing Eagle models

Implementation Details

Files Modified

vllm/transformers_utils/configs/speculators_eagle.py (new):

class SpeculatorsEagleConfig(EAGLEConfig):
    @classmethod
    def from_pretrained(cls, path, **kwargs):
        # Load and convert speculators config to vLLM format
    
    @classmethod
    def _convert_speculators_to_vllm(cls, config):
        # Field mappings:
        # - fusion_bias → eagle_fc_bias
        # - layernorms → model.add_para_norm
        # - transformer_layer_config → model

vllm/model_executor/models/eagle.py:

def load_weights(self, weights):
    speculators_name_map = {
        "fusion_fc.weight": "fc.weight",
        "fusion_fc.bias": "fc.bias",
        "embedding_layernorm.weight": "enorm.weight",
        "pre_lm_head_layernorm.weight": "hnorm.weight",
    }
    # Apply mappings during weight loading

vllm/transformers_utils/config.py:
- Added speculators Eagle detection logic
- Routes to SpeculatorsEagleConfig when detected

Testing

Models Tested

1. Standard Eagle (without layernorms)

Model: nm-testing/eagle-llama3.1-8b-instruct
Size: 481MB
Config: {"layernorms": false, "fusion_bias": false}
Status: ✅ Loads and generates correctly

2. HASS Variant (with layernorms)

Model: nm-testing/hass-llama3.1-8b-layernorms
Size: 961MB
Config: {"layernorms": true, "fusion_bias": false}
Additional weights: embedding_layernorm, pre_lm_head_layernorm
Status: ✅ Loads and generates correctly

Test Results

Both models successfully:

Load configurations correctly with proper field mapping
Map all weights properly (including layernorm weights for HASS)
Generate coherent text
Maintain performance characteristics

Verification Script

#!/usr/bin/env python3
"""Test all speculators Eagle models with vLLM."""

from vllm import LLM, SamplingParams

# Eagle models to test
models = {
    "Standard Eagle (local)": "/home/rahul/speculators/eagle-standard-converted",
    "Standard Eagle (HuggingFace)": "nm-testing/eagle-llama3.1-8b-instruct",
    "HASS Eagle (local)": "/home/rahul/speculators/hass-layernorms-converted",
    "HASS Eagle (HuggingFace)": "nm-testing/hass-llama3.1-8b-layernorms",
}

# Test each model
results = []
prompt = "The benefits of open source software include"

for name, eagle_model in models.items():
    try:
        # Load models
        llm = LLM(
            model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # Base model
            speculative_config={
                "model": eagle_model,  # Eagle speculator
                "num_speculative_tokens": 5
            },
            trust_remote_code=True,
            gpu_memory_utilization=0.4,
            enforce_eager=True,
            max_model_len=1024,
        )
        
        # Generate text
        output = llm.generate([prompt], SamplingParams(temperature=0, max_tokens=30))
        generated_text = output[0].outputs[0].text.strip()
        
        results.append((name, "PASSED", generated_text))
        del llm
        
    except Exception as error:
        results.append((name, "FAILED", f"Error: {error}"))

# Show results (appears after vLLM logs)
print("\n" + "="*80)
print("TEST RESULTS")
print("="*80)

for name, status, output in results:
    print(f"\n{name}: {status}")
    print(f"Output: {output}")

print("\n" + "="*80)
passed = sum(1 for _, status, _ in results if status == "PASSED")
print(f"Summary: {passed} out of {len(results)} tests passed")

# Exit with error code if any failed
exit(0 if passed == len(results) else 1)

Verification Output

================================================================================
TEST RESULTS
================================================================================

Standard Eagle (local): PASSED
Output: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.

Standard Eagle (HuggingFace): PASSED
Output: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.

HASS Eagle (local): PASSED
Output: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.

HASS Eagle (HuggingFace): PASSED
Output: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.

================================================================================
Summary: 4 out of 4 tests passed

Future Work

Phase 1: Direct Model Serving (Future PR)

Enable vllm serve <speculators_model> by automatically configuring speculative decoding:

# Future capability for both model types
vllm serve nm-testing/eagle-llama3.1-8b-instruct  # Standard Eagle
vllm serve nm-testing/hass-llama3.1-8b-layernorms  # HASS variant

# Would automatically:
# 1. Detect speculators format
# 2. Extract verifier model from config
# 3. Set up speculative decoding
# 4. Configure layernorms if present

Implementation approach:

Create SpeculatorsModelConfig that reads verifier info from speculators config
Modify EngineArgs to auto-configure when detecting speculators models

Phase 2: Extended Speculators Support

Support for other speculators model types
Unified config detection and routing system

Phase 3: Ecosystem Integration

Direct speculators library integration for on-the-fly conversion

Design Decisions

Translation Layer vs. Native Support: Chose translation to maintain compatibility with existing vLLM Eagle implementation
Weight Remapping Location: Implemented in load_weights() for transparency and minimal code changes
Config Detection: Used speculators_model_type field as definitive indicator
Backward Compatibility: All changes are additive; existing Eagle models continue to work
Layernorm Support: Automatically detected via layernorms field, mapped to vLLM's add_para_norm

Performance Impact

No runtime overhead after model loading
Config translation happens once during initialization
Weight remapping is a simple dictionary lookup during loading

Dependencies

No new dependencies required. Uses existing vLLM and transformers infrastructure.

Checklist

Code follows vLLM style guidelines
Added config translation with proper field mappings
Implemented weight name remapping
Tested with multiple model variants (standard Eagle and HASS)
Maintains backward compatibility
Add unit tests for config conversion
Add integration tests
Update documentation

References

This is a draft PR for initial feedback.

github-actions · 2025-07-03T11:32:37Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @rahul-tuli, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces compatibility for Eagle models that have been converted using the speculators library. It establishes a robust translation layer within vLLM to seamlessly handle differences in configuration and weight naming conventions between the speculators ecosystem and vLLM's native Eagle implementation, enabling speculative decoding with these models.

Highlights

Support for Speculators Eagle Models: This pull request enables vLLM to load and utilize Eagle models that have been converted using the speculators library. This bridges the compatibility gap between the speculators ecosystem and vLLM's native Eagle implementation, allowing for speculative decoding with these models.
Configuration Translation Layer: A new SpeculatorsEagleConfig class has been introduced. This class automatically detects speculators-specific configuration formats (e.g., speculators_model_type, transformer_layer_config, fusion_bias, layernorms) and translates them into vLLM's expected Eagle configuration format during model loading.
Automated Weight Remapping: Logic has been added to the EAGLE.load_weights() method to transparently remap weight names. This handles the differences between speculators' naming conventions (e.g., fusion_fc.weight, transformer.) and vLLM's internal naming scheme, ensuring that weights are loaded correctly without manual intervention.
Seamless Integration and Backward Compatibility: The get_config() utility has been updated to automatically detect speculators Eagle models based on their configuration. These models are then routed through the new translation layer. This integration is designed to be additive, ensuring full backward compatibility with existing vLLM Eagle models.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for loading Eagle models from the speculators library by introducing a configuration adapter and weight remapping. The overall approach is sound. The review focuses on the new SpeculatorsEagleConfig implementation and identifies critical issues related to handling remote models from the Hugging Face Hub, which would prevent the feature from working in a common use case. Detailed suggestions are provided to fix these issues by properly fetching remote configurations, along with a minor suggestion to improve code maintainability.

vllm/transformers_utils/configs/speculators_eagle.py

dsikka

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

dsikka · 2025-07-04T17:45:13Z

vllm/transformers_utils/config.py

@@ -334,6 +336,17 @@ def get_config(
            raise ValueError(error_message) from e

    if config_format == ConfigFormat.HF:
+        # Check if this is a speculators Eagle model
+        if is_speculators_eagle_config(model):
+            config = SpeculatorsEagleConfig.from_pretrained(


Are all existing supported models just going through the PretrainedConfig pathway?

rahul-tuli · 2025-07-04T21:25:46Z

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]
2025-07-04 17:14:28.671 | DEBUG    | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin
2025-07-04 17:14:29.640 | INFO     | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights
2025-07-04 17:14:29.640 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig
2025-07-04 17:14:29.641 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']
2025-07-04 17:14:29.700 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json
2025-07-04 17:14:29.701 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors
2025-07-04 17:14:30.157 | SUCCESS  | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

# UPDATE THIS PATH
eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

# Create LLM with Eagle speculative decoding
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
    speculative_config={
        "model": eagle_model_path,  # Your Eagle model path
        "num_speculative_tokens": 5,  # Number of tokens to predict ahead
    },
    trust_remote_code=True,
    gpu_memory_utilization=0.4,
    max_model_len=1024, 
)

print("Models loaded! Generating text...")

# Your prompt
prompt = "The benefits of open source software include"


sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=100,
)

# Generate text
output = llm.generate([prompt], sampling_params)[0]
generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")
print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...
Models loaded! Generating text...
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]
Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00,  2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include
Generated: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
2. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
3. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
4. Security: Open source software can be more secure
[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

dsikka · 2025-07-05T21:18:36Z

Can you post steps to run?
I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
2025-07-04 17:14:28.581 | INFO     | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]
2025-07-04 17:14:28.671 | DEBUG    | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json
2025-07-04 17:14:28.672 | DEBUG    | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin
2025-07-04 17:14:29.640 | INFO     | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights
2025-07-04 17:14:29.640 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig
2025-07-04 17:14:29.641 | DEBUG    | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']
2025-07-04 17:14:29.644 | DEBUG    | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']
2025-07-04 17:14:29.700 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json
2025-07-04 17:14:29.701 | DEBUG    | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors
2025-07-04 17:14:30.157 | SUCCESS  | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

# UPDATE THIS PATH
eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

# Create LLM with Eagle speculative decoding
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
    speculative_config={
        "model": eagle_model_path,  # Your Eagle model path
        "num_speculative_tokens": 5,  # Number of tokens to predict ahead
    },
    trust_remote_code=True,
    gpu_memory_utilization=0.4,
    max_model_len=1024, 
)

print("Models loaded! Generating text...")

# Your prompt
prompt = "The benefits of open source software include"


sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=100,
)

# Generate text
output = llm.generate([prompt], sampling_params)[0]
generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")
print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...
Models loaded! Generating text...
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]
Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00,  2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include
Generated: :
1. Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.
2. Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.
3. Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.
4. Security: Open source software can be more secure
[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

cli does not seem to be working on that branch.

rahul-tuli · 2025-07-07T00:43:20Z

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:
speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct
Output:
2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]

2025-07-04 17:14:28.671 | DEBUG | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin

2025-07-04 17:14:29.640 | INFO | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights

2025-07-04 17:14:29.640 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig

2025-07-04 17:14:29.641 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']

2025-07-04 17:14:29.700 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json

2025-07-04 17:14:29.701 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors

2025-07-04 17:14:30.157 | SUCCESS | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle
The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:
from vllm import LLM, SamplingParams

UPDATE THIS PATH

eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

Create LLM with Eagle speculative decoding

llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
speculative_config={
    "model": eagle_model_path,  # Your Eagle model path
    "num_speculative_tokens": 5,  # Number of tokens to predict ahead
},
trust_remote_code=True,
gpu_memory_utilization=0.4,
max_model_len=1024, 
)

print("Models loaded! Generating text...")

Your prompt

prompt = "The benefits of open source software include"

sampling_params = SamplingParams(
temperature=0.0,
max_tokens=100,
)

Generate text

output = llm.generate([prompt], sampling_params)[0]

generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")

print(f"Generated: {generated_text}")
Output:
...(truncated for brevity)...

Models loaded! Generating text...

Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]

Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00, 2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include

Generated: :

Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.

Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.

Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.

Security: Open source software can be more secure

[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
cli does not seem to be working on that branch.

Could you paste the command you used, and the trace back on that PR?

dsikka · 2025-07-07T00:44:30Z

Can you post steps to run?

I wasn't able to run the verification script on this branch with speculators main.

We don't need speculators to run the models; here are the steps:

Step 1: convert an existing eagle or hass checkpoint with speculators convert utility from this branch (yet to land on main): neuralmagic/speculators#39

There is a doc explaining how to use the convert utility here: https://github.com/neuralmagic/speculators/blob/efab1758d803e03f42c85cc67425cefa80c5344f/docs/convert.md

for example convert an existing eagle-checkpoint using:

speculators convert --eagle yuhuili/EAGLE-LLaMA3.1-Instruct-8B ./converted/eagle meta-llama/Llama-3.1-8B-Instruct

Output:

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:convert:50 - Converting Eagle checkpoint: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

2025-07-04 17:14:28.581 | INFO | speculators.convert.eagle.eagle_converter:_ensure_local:93 - Downloading checkpoint from HuggingFace: yuhuili/EAGLE-LLaMA3.1-Instruct-8B

Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31775.03it/s]

2025-07-04 17:14:28.671 | DEBUG | speculators.convert.eagle.eagle_converter:_ensure_local:100 - Downloaded to: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:118 - Loading config from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/config.json

2025-07-04 17:14:28.672 | DEBUG | speculators.convert.eagle.eagle_converter:_load_checkpoint:133 - Loading PyTorch weights from: /home/rahul/.cache/huggingface/hub/models--yuhuili--EAGLE-LLaMA3.1-Instruct-8B/snapshots/89073acba22a03994aee0c76774a10ca941e4706/pytorch_model.bin

2025-07-04 17:14:29.640 | INFO | speculators.convert.eagle.eagle_converter:convert:55 - Loaded 10 weights

2025-07-04 17:14:29.640 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:169 - Building EagleSpeculatorConfig

2025-07-04 17:14:29.641 | DEBUG | speculators.convert.eagle.eagle_converter:_build_config:219 - Config built with fusion_bias=False, layernorms=False

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:242 - Processing 10 weights

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_single_weight:277 - Skipping embed_tokens.weight (tied to lm_head)

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:259 - Skipped weights: ['embed_tokens.weight']

2025-07-04 17:14:29.644 | DEBUG | speculators.convert.eagle.eagle_converter:_process_weights:261 - Remapped weights: ['layers.0.self_attn.q_proj.weight -> transformer.self_attn.q_proj.weight', 'layers.0.self_attn.k_proj.weight -> transformer.self_attn.k_proj.weight', 'layers.0.self_attn.v_proj.weight -> transformer.self_attn.v_proj.weight', 'layers.0.self_attn.o_proj.weight -> transformer.self_attn.o_proj.weight', 'layers.0.mlp.gate_proj.weight -> transformer.mlp.gate_proj.weight', 'layers.0.mlp.up_proj.weight -> transformer.mlp.up_proj.weight', 'layers.0.mlp.down_proj.weight -> transformer.mlp.down_proj.weight', 'layers.0.post_attention_layernorm.weight -> transformer.post_attention_layernorm.weight', 'fc.weight -> fusion_fc.weight']

2025-07-04 17:14:29.700 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:319 - Saving config to: converted/eagle/config.json

2025-07-04 17:14:29.701 | DEBUG | speculators.convert.eagle.eagle_converter:_save_checkpoint:325 - Saving weights to: converted/eagle/model.safetensors

2025-07-04 17:14:30.157 | SUCCESS | speculators.convert.eagle.eagle_converter:convert:72 - Saved to: converted/eagle

The converted checkpoint will be saved in converted/eagle (after this point speculators is not needed)

Step 2: Checkout the current branch in vllm, and run the model:

from vllm import LLM, SamplingParams

UPDATE THIS PATH

eagle_model_path = "/home/rahul/speculators/converted/eagle"

print("Loading models...")

Create LLM with Eagle speculative decoding

llm = LLM(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # target/verifier model
speculative_config={
    "model": eagle_model_path,  # Your Eagle model path
    "num_speculative_tokens": 5,  # Number of tokens to predict ahead
},
trust_remote_code=True,
gpu_memory_utilization=0.4,
max_model_len=1024, 
)

print("Models loaded! Generating text...")

Your prompt

prompt = "The benefits of open source software include"

sampling_params = SamplingParams(
temperature=0.0,
max_tokens=100,
)

Generate text

output = llm.generate([prompt], sampling_params)[0]

generated_text = output.outputs[0].text

print(f"\nPrompt: {prompt}")

print(f"Generated: {generated_text}")

Output:

...(truncated for brevity)...

Models loaded! Generating text...

Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 74.57it/s]

Processed prompts: 100%|██████████████████████████| 1/1 [00:02<00:00, 2.42s/it, est. speed input: 3.31 toks/s, output: 41.41 toks/s]

Prompt: The benefits of open source software include

Generated: :

Cost savings: Open source software is often free or low-cost, which can be a significant advantage for individuals and organizations with limited budgets.

Customization: Open source software can be modified and customized to meet specific needs, which can be particularly useful for businesses or organizations with unique requirements.

Community support: Open source software often has a large and active community of developers and users who contribute to its development and provide support.

Security: Open source software can be more secure

[rank0]:[W704 17:23:14.420996530 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

cli does not seem to be working on that branch.
Could you paste the command you used, and the trace back on that PR?

I’ll touch base offline. It just didn’t recognize the speculators command

dsikka · 2025-07-07T18:59:27Z

vllm/transformers_utils/configs/speculators_eagle.py

+                transformer_config["architectures"] = [arch]
+
+        # Build vLLM config
+        vllm_config = {


Why don't we need to add the verifier model as part of the config? How are the two differentiated in the vllm_config object?

dsikka

question

mergify · 2025-07-08T03:03:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rahul-tuli.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

dsikka · 2025-07-15T17:16:18Z

vllm/model_executor/models/llama_eagle.py

@@ -22,6 +23,23 @@

 logger = init_logger(__name__)

+# Weight name mapping for speculators format compatibility
+SPECULATORS_WEIGHT_MAP = {
+    "fusion_fc.weight": "fc.weight",


I though we removed this translation in speculators?

dsikka · 2025-07-15T17:18:34Z

vllm/transformers_utils/configs/speculators_eagle.py

+DEFAULT_NUM_LOOKAHEAD_TOKENS = 5
+
+
+class SpeculatorsEagleConfig(EAGLEConfig):


Why did we change this from SpeculatorsConfig?

- Add SpeculatorsEagleConfig to handle speculators config format - Update config loader to detect speculators Eagle models - Add weight name remapping in Eagle model load_weights - Support both standard Eagle and HASS (with layernorms) variants This enables vLLM to load Eagle models converted using the speculators library's checkpoint converter, mapping config fields and weight names to vLLM's expected format. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Remove unused Any, Dict, Optional imports - Remove unused AutoConfig import - Keep only Union which is actually used in type annotations 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Use PretrainedConfig.get_config_dict() to handle both local and HF paths - Simplifies the code and follows best practices - Tested with both local paths and HuggingFace model IDs 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Set method='eagle' in vllm_config to ensure proper model detection - This field is required by EAGLEConfig parent class - Helps with future V1 engine compatibility 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Changed method field in vllm_config to use speculators_config.get("speculators_model_type", "eagle") - This allows the method to be dynamically set based on the speculators model type - Maintains backward compatibility with default value of "eagle" Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Added check for model_type == "eagle" in SpeculativeConfig auto-detection - This ensures speculators Eagle models are properly detected and method is set to "eagle" - Fixes V1 engine compatibility check for speculators Eagle models Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Import is_speculators_eagle_config function - Add simple check for speculators Eagle models when method is not set - Minimal change that handles speculators format as a special case - Fixes issue where speculative_method was None causing V0 fallback Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Added speculators_name_map to handle fusion_fc -> fc weight remapping - Also handles transformer.* -> model.layers.0.* prefix remapping - Fixes KeyError for fusion_fc.weight when loading speculators Eagle models - Similar to the remapping already added to eagle.py model Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Updated llama_eagle.py to skip transformer weights (loaded separately) - Added num_lookahead_tokens to speculators config (required for Eagle) - Together these fixes allow speculators Eagle models to work with V1 engine Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Explains all changes needed for speculators Eagle models - Details the rationale behind each modification - Includes common questions and answers - Provides testing examples - Documents config translation, weight remapping, and V1 detection Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Updated speculators config detection to check for speculators_model_type key - Support both eagle and eagle3 in is_speculators_eagle_config - Handle Eagle-3 specific config fields (draft_vocab_size, target_hidden_size) - Infer target_hidden_size from transformer config if not provided - Skip non-existent weights in llama_eagle to handle HASS models gracefully - Eagle-3 models don't need weight translation (already use correct names) This enables support for: - nm-testing/eagle3-llama3.1-8b-instruct-speculators - nm-testing/EAGLE3-LLaMA3.3-Instruct-70B-speculators While maintaining backward compatibility with Eagle-1 models. Signed-off-by: rtuli@redhat.com 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Add RMSNorm import and support for enorm/hnorm in llama_eagle.py - Apply layernorms in forward pass when add_para_norm is enabled - Handle speculators weight remapping in EagleLlamaForCausalLM.load_weights - Fixes HASS Eagle models (nm-testing/hass-llama3.1-8b-layernorms) in V1 engine 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Remove redundant model_type field from vllm_config (already defined in EAGLEConfig) - Extract num_lookahead_tokens from proposal_methods in speculators config - Add proper assertions for required speculators config structure - Remove unnecessary intermediate variable speculators_cfg 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

This documentation is no longer needed as the implementation is complete. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

- Remove V0 engine changes from eagle.py - Keep V1 engine support in llama_eagle.py with layernorm support - Maintain config detection and translation for speculators format - Ensure V1 engine compatibility for all Eagle models This simplifies the implementation by focusing only on the modern V1 engine which provides better performance and features. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Move SPECULATORS_WEIGHT_MAP to module level to eliminate duplication - Replace duplicate _remap_weight_name methods with single function - Fix line continuation style to use proper parentheses - Streamline weight loading logic while preserving functionality - Remove verbose comments while keeping essential documentation - Preserve original 'fc' naming convention This consolidation improves maintainability and follows vLLM code style conventions while preserving all existing functionality for both Eagle-1 and Eagle-3 speculators models. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: rtuli@redhat.com Co-Authored-By: Claude <noreply@anthropic.com>

- Add weight name mapping for speculators format compatibility - Support HASS variant with additional layernorms - Handle both Eagle-1 and Eagle-3 configurations - Maintain backward compatibility with existing Eagle models This enables using Eagle draft models packaged with the speculators library directly in vLLM for speculative decoding.

dsikka · 2025-07-15T21:13:45Z

vllm/model_executor/models/llama_eagle.py

@@ -159,6 +204,11 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]):

        model_weights = {}
        for name, loaded_weight in weights:
+            remapped_name = remap_speculators_weight_name(name)


should this be under a check where we first check if there's a speculators config present in self.config?

gemini-code-assist bot reviewed Jul 3, 2025

View reviewed changes

mergify bot added the speculative-decoding label Jul 3, 2025

gemini-code-assist bot reviewed Jul 3, 2025

View reviewed changes

vllm/transformers_utils/configs/speculators_eagle.py Outdated Show resolved Hide resolved

vllm/transformers_utils/configs/speculators_eagle.py Outdated Show resolved Hide resolved

vllm/transformers_utils/configs/speculators_eagle.py Outdated Show resolved Hide resolved

dsikka reviewed Jul 4, 2025

View reviewed changes

dsikka reviewed Jul 7, 2025

View reviewed changes

mergify bot added the needs-rebase label Jul 8, 2025

rahul-tuli force-pushed the feat/speculators-eagle-support branch from e9fecc1 to 8e1183c Compare July 9, 2025 13:38

mergify bot removed the needs-rebase label Jul 9, 2025

rahul-tuli force-pushed the feat/speculators-eagle-support branch 4 times, most recently from 85a11c7 to 81c9904 Compare July 9, 2025 14:30

mergify bot added llama Related to Llama models documentation Improvements or additions to documentation labels Jul 9, 2025

rahul-tuli force-pushed the feat/speculators-eagle-support branch 3 times, most recently from 9f730f4 to 875b786 Compare July 15, 2025 13:22

aarnphm self-assigned this Jul 15, 2025

dsikka reviewed Jul 15, 2025

View reviewed changes

mergify bot added the performance Performance-related issues label Jul 15, 2025

rahul-tuli and others added 3 commits July 15, 2025 14:11

rahul-tuli and others added 15 commits July 15, 2025 14:11

chore: Remove V1 engine Eagle support documentation

e4e87fb

This documentation is no longer needed as the implementation is complete. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

feat: Comprehensive code cleanup for speculators Eagle support

95f6069

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

rahul-tuli force-pushed the feat/speculators-eagle-support branch from 36b8502 to 00da923 Compare July 15, 2025 18:14

rahul-tuli added 3 commits July 15, 2025 14:17

remove changes to gitignore

d63ef14

add back .gitignore

b905811

Add norm_before_residual support for llama_eagle3.py

7df8c9d

dsikka reviewed Jul 15, 2025

View reviewed changes

rahul-tuli added 2 commits July 15, 2025 18:15

Fix bug

1408fb8

simplify logic

46e398a

		DEFAULT_NUM_LOOKAHEAD_TOKENS = 5


		class SpeculatorsEagleConfig(EAGLEConfig):

Uh oh!

feat: Add support for speculators Eagle checkpoints #20436

Are you sure you want to change the base?

feat: Add support for speculators Eagle checkpoints #20436

Conversation

rahul-tuli commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Technical Details

Problem Statement

Solution Architecture

Implementation Details

Files Modified

Testing

Models Tested

1. Standard Eagle (without layernorms)

2. HASS Variant (with layernorms)

Test Results

Verification Script

Verification Output

Future Work

Phase 1: Direct Model Serving (Future PR)

Phase 2: Extended Speculators Support

Phase 3: Ecosystem Integration

Design Decisions

Performance Impact

Dependencies

Checklist

References

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsikka Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

rahul-tuli Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

rahul-tuli commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsikka commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahul-tuli commented Jul 7, 2025

UPDATE THIS PATH

Create LLM with Eagle speculative decoding

Your prompt

Generate text

Uh oh!

dsikka commented Jul 7, 2025

UPDATE THIS PATH

Create LLM with Eagle speculative decoding

Your prompt

Generate text

Uh oh!

dsikka Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

rahul-tuli commented Jul 3, 2025 •

edited

Loading

dsikka left a comment •

edited

Loading

rahul-tuli commented Jul 4, 2025 •

edited

Loading

dsikka commented Jul 5, 2025 •

edited

Loading

dsikka Jul 7, 2025 •

edited

Loading

dsikka Jul 15, 2025 •

edited

Loading