[Meta] Llama4 EAGLE Support #20591

morgendave · 2025-07-07T21:43:38Z

Purpose

Support EAGLE speculative decoding with dense-only draft model for Llama4, using official Meta based support
Original Author: @zixi-qi

Test Plan

Ran with a uploaded scout based eagle to test E2E
Example cmd

CUDA_VISIBLE_DEVICES=4,5,6,7 VLLM_USE_V1=1 python examples/offline_inference/spec_decode.py  --num_spec_tokens 7 --num_prompts 1 --method eagle --model_dir /home/zhiweiz/local/models/scout_base_HF_20250605_201140 --eagle_dir /home/zhiweiz/local/models/scout_draft_HF_20250605_202942 --tp 4

unit test: python -m pytest tests/v1/e2e/test_spec_decode.py

vllm serve + benchmarking
EAGLE server cmd

#!/bin/bash
# Configuration of environment variables
export VLLM_USE_MODELSCOPE=False
export VLLM_TORCH_PROFILER_DIR=~/vllm_profile
export CUDA_VISIBLE_DEVICES=4,5,6,7
export VLLM_USE_V1=1
export SAFETENSORS_FAST_GPU=1
# Command to run the vllm server
spec_dec_config='{"method": "eagle", "model": "/home/zhiweiz/local/models/scout_draft_HF_20250605_202942", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 4, "max_model_len": 32768}'
vllm serve /home/zhiweiz/local/models/scout_base_HF_20250605_201140 --disable-log-requests \
    -tp 4 \
    --max-num-seqs 128 \
    --max_num_batched_tokens=80000 \
    --max-model-len=32768 \
    --no-enable-prefix-caching \
    --trust-remote-code \
    --speculative-config="$spec_dec_config" \
    --num-lookahead-slots=3 \
    2>&1 | tee /data/users/$USER/logs/server/vllm_17b16e_vllm_serving$(date +%Y%m%d_%H%M%S).log

base cmd = eagle server cmd, removing --speculative-config="$spec_dec_config" \

benchmarking

python benchmarks/benchmark_serving.py --backend vllm --model /home/zhiweiz/local/models/scout_base_HF_20250605_201140 --dataset-name hf --dataset-path philschmid/mt-bench --seed 0 --max-concurrency 16 2>&1 | tee /data/users/$USER/tmp/vllm_17b16e_vllm_loadgen$(date +%Y%m%d_%H%M%S).log\n

Test Result

total_num_output_tokens: 256
num_drafts: 96
num_draft_tokens: 672
num_accepted_tokens: 159
mean acceptance length: 2.66
--------------------------------------------------
acceptance at token 0: 0.88
acceptance at token 1: 0.78
acceptance at token 2: 0.00
acceptance at token 3: 0.00
acceptance at token 4: 0.00
acceptance at token 5: 0.00
acceptance at token 6: 0.00

unit test passed

EAGLE Bechmark

Maximum request concurrency: 16
100%|██████████| 1000/1000 [03:17<00:00,  5.07it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  197.21
Total input tokens:                      77841
Total generated tokens:                  219813
Request throughput (req/s):              5.07
Output token throughput (tok/s):         1114.62
Total Token throughput (tok/s):          1509.34
---------------Time to First Token----------------
Mean TTFT (ms):                          88.57
Median TTFT (ms):                        82.45
P99 TTFT (ms):                           404.05
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.85
Median TPOT (ms):                        13.71
P99 TPOT (ms):                           18.93
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.44
Median ITL (ms):                         37.44
P99 ITL (ms):                            53.12
==================================================

Accepted tokens average: 2.75-2.95

Baseline

Maximum request concurrency: 16
100%|██████████| 1000/1000 [05:10<00:00,  3.22it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  310.13
Total input tokens:                      77841
Total generated tokens:                  220354
Request throughput (req/s):              3.22
Output token throughput (tok/s):         710.51
Total Token throughput (tok/s):          961.50
---------------Time to First Token----------------
Mean TTFT (ms):                          61.32
Median TTFT (ms):                        56.01
P99 TTFT (ms):                           146.22
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.19
Median TPOT (ms):                        22.20
P99 TPOT (ms):                           23.44
---------------Inter-token Latency----------------
Mean ITL (ms):                           22.18
Median ITL (ms):                         21.29
P99 ITL (ms):                            40.81
==================================================

Follow up

Upload draft model to huggingface
Scout based Draft: morgendave/EAGLE-Llama-4-Scout-17B-16E-Instruct
Pending: Maverick Draft: morgendave/EAGLE-Llama-4-Maverick-17B-128E-Instruct

github-actions · 2025-07-07T21:43:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @morgendave, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances VLLM's capabilities by integrating speculative decoding support for Llama4 models, specifically designed to work with a dense-only draft model. It introduces a new model implementation tailored for this purpose and updates an example script to demonstrate its usage, thereby expanding the range of supported models for efficient inference.

Highlights

Llama4 EAGLE Support: This pull request introduces core support for Llama4 models within the EAGLE speculative decoding framework. This enables the use of a dense-only draft model specifically designed for Llama4, leveraging official Meta-based support.
New Model Implementation: A new file, vllm/model_executor/models/llama4_eagle.py, has been added. This file defines LlamaModel, a specialized draft model for Llama4, and EagleLlama4ForCausalLM, which integrates this draft model for speculative decoding. It includes custom weight loading logic for stacked parameters and handles configuration validation for features like YoCo and MoE.
Model Registration: The newly implemented EagleLlama4ForCausalLM has been registered in vllm/model_executor/models/registry.py at line 242, making it discoverable and usable by the VLLM system.
Example Script Updates: The examples/offline_inference/spec_decode.py script has been updated to facilitate testing of the new Llama4 EAGLE support. Specifically, gpu_memory_utilization was adjusted from 0.8 to 0.7 (line 84), and max_model_len=16384 was added to the engine initialization (line 87).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for EAGLE speculative decoding for Llama4 models. The changes look good overall, with a new model file vllm/model_executor/models/llama4_eagle.py implementing the draft model logic, a corresponding entry in the model registry, and updates to an example script.

My review focuses on the new llama4_eagle.py file. I've identified a few areas for improvement regarding code style, maintainability, and memory efficiency. Specifically, I've suggested removing unnecessary line continuations, addressing a potentially fragile inheritance pattern, adding a missing type hint, and refactoring a weight loading function to be more memory-efficient. These changes should improve the quality and robustness of the new implementation.

gemini-code-assist · 2025-07-07T21:45:22Z

vllm/model_executor/models/llama4_eagle.py

+        self.config = vllm_config. \
+            speculative_config.draft_model_config.hf_config


This line continuation with a backslash (\) is unnecessary and not compliant with PEP 8. The line is well within the recommended character limit and can be written on a single line for better readability.

self.config = vllm_config.speculative_config.draft_model_config.hf_config

gemini-code-assist · 2025-07-07T21:45:22Z

vllm/model_executor/models/llama4_eagle.py

+class EagleLlama4ForCausalLM(Llama4ForCausalLM):
+
+    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
+        nn.Module.__init__(self)


The __init__ method of EagleLlama4ForCausalLM calls nn.Module.__init__(self) directly, bypassing the initializer of its base class, Llama4ForCausalLM. This is a fragile design because if Llama4ForCausalLM's __init__ sets up important state that is used by inherited methods (like permute_qk_weight_for_rotary), this implementation could break.

To improve maintainability, consider either using composition over inheritance or ensuring a proper call to super().__init__(...).

gemini-code-assist · 2025-07-07T21:45:22Z

vllm/model_executor/models/llama4_eagle.py

+        self.config = vllm_config. \
+            speculative_config.draft_model_config.hf_config


Similar to a previous comment, this line continuation with a backslash (\) is unnecessary and not compliant with PEP 8. Please write this on a single line.

self.config = vllm_config.speculative_config.draft_model_config.hf_config

gemini-code-assist · 2025-07-07T21:45:22Z

vllm/model_executor/models/llama4_eagle.py

+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        return self.model(input_ids, positions, hidden_states)
+
+    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]):


The load_weights method is missing a return type hint. According to PEP 484, methods that do not return a value should be annotated with -> None to improve code clarity and allow for better static analysis.

def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> None:

gemini-code-assist · 2025-07-07T21:45:22Z

vllm/model_executor/models/llama4_eagle.py

+        model_weights = {}
+        weights = [
+            self.permute_qk_weight_for_rotary(name, loaded_weight)
+            for name, loaded_weight in weights
+        ]
+        for name, loaded_weight in weights:
+            if "lm_head" not in name:
+                name = "model." + name
+            model_weights[name] = loaded_weight
+
+        loader.load_weights(model_weights.items())


This block of code for processing weights is not memory-efficient. It first creates a list of all permuted weights, and then a dictionary of these weights, both of which can consume a large amount of memory for large models.

A more memory-efficient approach is to use a generator to process the weights one by one. This avoids loading all weights into memory at once.

def _processed_weights(): for name, loaded_weight in weights: name, loaded_weight = self.permute_qk_weight_for_rotary( name, loaded_weight) if "lm_head" not in name: name = "model." + name yield name, loaded_weight loader.load_weights(_processed_weights())

aarnphm · 2025-07-08T00:26:55Z

#18369

morgendave · 2025-07-08T14:27:58Z

#18369

Yes, we are pretty aware of this, that's why from Meta side we want to upstream this first in order to avoid conflicts as we are working on official supports. Also our official designs for llama4_eagle is dense based to get best performance.

aarnphm · 2025-07-09T09:18:48Z

vllm/model_executor/models/llama4_eagle.py

@@ -0,0 +1,199 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project


Can you also add the copyright from Meta side?

Absolutely, thanks for the suggestion

Example to follow: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py

houseroad · 2025-07-09T16:15:21Z

examples/offline_inference/spec_decode.py

        speculative_config=speculative_config,
        disable_log_stats=False,
+        max_model_len=16384,


is this needed due to OOM issue?

~~I think this might not be needed, it is just from our internal test, though I think default is smaller than this?~~
Yes this is for OOM as original length with BF16 will be too big

houseroad

I saw the test plan, we still need force_eager, does it mean the CUDA graph or torch.compile still doesn't work yet?

houseroad · 2025-07-09T16:16:16Z

vllm/model_executor/models/llama4_eagle.py

@@ -0,0 +1,199 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project


Example to follow: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama4.py

morgendave · 2025-07-09T16:20:32Z

I saw the test plan, we still need force_eager, does it mean the CUDA graph or torch.compile still doesn't work yet?

Nope, just copied from @zixi-qi's runbook. I can delete that and should work

RonaldBXu · 2025-07-10T00:25:39Z

Hi @morgendave , I'm the author of #18369 . I noticed that our PRs are for the same purpose and have similar code, like the padding of no_rope_layers. The eagle head that I uploaded is also dense only (for the best performance). I noticed that your eagle model has 3 decoder layers, is this usually the case, or maybe it just has better performance? I don't think that is compatible with my code. I also tried running your code straight from your PR (target model: scout, draft model: scout) and changing the max_model_len to a smaller number to avoid OOM; it runs but it doesn't seem to give me acceptance>1. Maybe there is some code still missing? I'll try to run it again.

I would also like to call out that my PR is almost 2 months old at this point and I feel like it is fair to merge my PR and then maybe you can build on top of it (I'm pretty sure it works for standard eagle models)? I recognize that you have added some quantization support and qk perm for the rotary which will be great additions. Thanks

morgendave · 2025-07-10T14:40:00Z

Hi @morgendave , I'm the author of #18369 . I noticed that our PRs are for the same purpose and have similar code, like the padding of no_rope_layers. The eagle head that I uploaded is also dense only (for the best performance). I noticed that your eagle model has 3 decoder layers, is this usually the case, or maybe it just has better performance? I don't think that is compatible with my code. I also tried running your code straight from your PR (target model: scout, draft model: scout) and changing the max_model_len to a smaller number to avoid OOM; it runs but it doesn't seem to give me acceptance>1. Maybe there is some code still missing? I'll try to run it again.

I would also like to call out that my PR is almost 2 months old at this point and I feel like it is fair to merge my PR and then maybe you can build on top of it (I'm pretty sure it works for standard eagle models)? I recognize that you have added some quantization support and qk perm for the rotary which will be great additions. Thanks

Sorry but this is the Meta's official support, it's also going to be followed up with MM and other support for next generation model so we have to merge this.

yeqcharlotte

you can add the following in the commit message so the author is set correctly

Co-authored-by: Zixi Qi qizixi@meta.com

Could you include the vllm serve command you use add some TTFT/TTIT numbers to the original PR? What's the E2E speed up we see?

yeqcharlotte · 2025-07-10T15:42:44Z

tests/v1/e2e/test_spec_decode.py

+     "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", 1),
+    ("eagle3", "meta-llama/Llama-3.1-8B-Instruct",
+     "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", 1),
+    ("eagle", "/home/zhiweiz/local/models/scout_base_HF_20250605_201140",


update to publicmodel repo

yeqcharlotte · 2025-07-10T15:45:30Z

examples/offline_inference/spec_decode.py

@@ -81,9 +81,10 @@ def main():
        tensor_parallel_size=args.tp,
        enable_chunked_prefill=args.enable_chunked_prefill,
        enforce_eager=args.enforce_eager,
-        gpu_memory_utilization=0.8,
+        gpu_memory_utilization=0.7,


is this the issue fixed in #20628 when max batch size is configured

Signed-off-by: qizixi <qizixi@meta.com>

luccafong · 2025-07-10T22:40:52Z

tests/v1/e2e/test_spec_decode.py

+     "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", 1),
+    ("eagle3", "meta-llama/Llama-3.1-8B-Instruct",
+     "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", 1),
+    ("eagle", "meta-llama/Llama-4-Scout-17B-16E-Instruct",


check if this will triggered in CI that might be OOM in CI jobs? https://github.com/vllm-project/vllm/blob/main/.buildkite/test-pipeline.yaml#L265 cc @WoosukKwon

yes the CI can't run this. I think we can keep the code, but do pytest.skip for Llama4 so that we can easily test it locally?

houseroad

Looks good.

DarkLight1337 · 2025-07-14T08:01:58Z

Please fix the failing tests

WoosukKwon

LGTM!

WoosukKwon · 2025-07-15T17:33:05Z

vllm/model_executor/models/llama4_eagle.py

+    def forward(
+        self,
+        input_ids: Optional[torch.Tensor],
+        positions: torch.Tensor,
+        hidden_states: torch.Tensor,
+    ) -> tuple[torch.Tensor, torch.Tensor]:


Will MM input support be added later?

Yep that's in #20591

…xt only model to be registered

gemini-code-assist bot reviewed Jul 7, 2025

View reviewed changes

mergify bot added documentation Improvements or additions to documentation llama Related to Llama models speculative-decoding labels Jul 7, 2025

gemini-code-assist bot reviewed Jul 7, 2025

View reviewed changes

morgendave requested review from houseroad and WoosukKwon July 8, 2025 14:13

mergify bot added the v1 label Jul 8, 2025

aarnphm reviewed Jul 9, 2025

View reviewed changes

aarnphm approved these changes Jul 9, 2025

View reviewed changes

houseroad reviewed Jul 9, 2025

View reviewed changes

mergify bot added the new-model Requests to new models label Jul 10, 2025

yeqcharlotte reviewed Jul 10, 2025

View reviewed changes

morgendave force-pushed the l4-eagle-upstream branch from 1cf4064 to 8894e8f Compare July 10, 2025 21:24

zixi-qi and others added 4 commits July 10, 2025 14:49

[Meta] Llama4 EAGLE Support Co-authored-by: Zixi Qi <qizixi@meta.com>

f1c861d

Signed-off-by: qizixi <qizixi@meta.com>

linting fixes

4178543

Add llama4 eagle to e2e test for spec decode

2f8fe49

Add Meta copyright and license

4238b3a

morgendave force-pushed the l4-eagle-upstream branch from 8894e8f to 4238b3a Compare July 10, 2025 21:49

luccafong approved these changes Jul 10, 2025

View reviewed changes

luccafong requested a review from houseroad July 10, 2025 22:42

houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 11, 2025

houseroad enabled auto-merge (squash) July 11, 2025 05:43

houseroad approved these changes Jul 11, 2025

View reviewed changes

DarkLight1337 added this to the v0.10.0 milestone Jul 14, 2025

add pytest skip for eagle llama4-scout to avoid OOM in CI

716ec8c

auto-merge was automatically disabled July 14, 2025 20:36
Head branch was pushed to by a user without write access

morgendave requested review from DarkLight1337 and ywang96 as code owners July 14, 2025 20:36

DarkLight1337 mentioned this pull request Jul 15, 2025

[Meta] Official Eagle mm support, first enablement on llama4 #20788

Open

WoosukKwon approved these changes Jul 15, 2025

View reviewed changes

Add unit test support for Llama4ForCausalLM arch due to there's no te…

5e2ea70

…xt only model to be registered

		self.config = vllm_config. \
		speculative_config.draft_model_config.hf_config

		@@ -0,0 +1,199 @@
		# SPDX-License-Identifier: Apache-2.0
		# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

Uh oh!

[Meta] Llama4 EAGLE Support #20591

Are you sure you want to change the base?

[Meta] Llama4 EAGLE Support #20591

Conversation

morgendave commented Jul 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Follow up

Uh oh!

github-actions bot commented Jul 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

aarnphm commented Jul 8, 2025

Uh oh!

morgendave commented Jul 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

morgendave Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

morgendave commented Jul 9, 2025

Uh oh!

RonaldBXu commented Jul 10, 2025

Uh oh!

morgendave commented Jul 10, 2025

Uh oh!

yeqcharlotte left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

morgendave commented Jul 7, 2025 •

edited by github-actions bot

Loading

morgendave Jul 9, 2025 •

edited

Loading

yeqcharlotte left a comment •

edited

Loading