-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
[Meta] Official Eagle mm support, first enablement on llama4 #20788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: qizixi <qizixi@meta.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @morgendave, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces comprehensive multimodal inference support for the EAGLE speculative decoding method, with an initial focus on Llama4 models. It enables the EAGLE draft model to process both text and image inputs, significantly enhancing its capability for multimodal large language models. The changes span from core model implementations to the speculative decoding pipeline and include new examples and robust test coverage for multimodal scenarios.
Highlights
- Multimodal EAGLE Support: Enables multimodal (MM) inference capabilities for the EAGLE speculative decoding method, specifically targeting Llama4 models, with a design that allows for future extension to other multimodal models.
- Llama4 EAGLE Draft Model: Introduces
EagleLlama4ForCausalLM
andLlamaModel
to provide a dedicated draft model implementation for Llama4, capable of processing combined text and image embeddings for speculative decoding. - Speculative Decoding Pipeline Integration: Integrates multimodal input handling into the core EAGLE speculative decoding pipeline, allowing the draft model to receive and process image embeddings alongside text tokens for more accurate proposals.
- Multimodal Testing & Examples: Adds a new example script (
spec_decode.py
) for running offline multimodal inference with custom image prompts and extends the end-to-end test suite to include multimodal EAGLE correctness tests for Llama4.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces multi-modal (MM) support for EAGLE speculative decoding, with an initial implementation for the llama4 model architecture. The changes are comprehensive, touching upon example scripts, E2E tests, and core model execution logic.
Overall, the implementation for llama4_eagle
looks solid. However, I've identified a few issues:
- Critical: The E2E tests contain hardcoded local paths, which will break CI and prevent other developers from running the tests.
- High: The
llama_eagle
andllama_eagle3
models have been updated to accept aninputs_embeds
parameter, but they don't actually use it. This is misleading and should be fixed to either fully support MM or explicitly disallow it.
Once these issues are addressed, the PR should be in good shape.
tests/v1/e2e/test_spec_decode.py
Outdated
(("eagle", "/home/zhiweiz/local/models/scout_base_HF_20250605_201140", | ||
"morgendave/EAGLE-Llama-4-Scout-17B-16E-Instruct", 4), False), | ||
(("eagle", "/home/zhiweiz/local/models/scout_base_HF_20250605_201140", | ||
"morgendave/EAGLE-Llama-4-Scout-17B-16E-Instruct", 4), True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test test_eagle_correctness
includes hardcoded local paths to a model (/home/zhiweiz/local/models/scout_base_HF_20250605_201140
). This makes the test non-portable and will cause it to fail in CI environments and on other developers' machines. Please replace this with a model from the Hugging Face Hub or use a mechanism to download test-specific models.
inputs_embeds: Optional[torch.Tensor] = None, | ||
) -> tuple[torch.Tensor, torch.Tensor]: | ||
return self.model(input_ids, positions, hidden_states) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inputs_embeds
parameter is added to the forward
method's signature but is not used within the method body. The call to self.model()
doesn't pass this parameter along, which means multimodal inputs will be ignored.
This is inconsistent with the implementation for llama4_eagle
and suggests that multimodal support is incomplete for this model. If multimodal input is not supported for this model, it would be better to raise a NotImplementedError
when inputs_embeds
is provided. If it is intended to be supported, inputs_embeds
should be passed to self.model
and handled there.
inputs_embeds: Optional[torch.Tensor] = None,
) -> tuple[torch.Tensor, torch.Tensor]:
if inputs_embeds is not None:
raise NotImplementedError(
f"{type(self).__name__} does not support multimodal inputs yet.")
return self.model(input_ids, positions, hidden_states)
inputs_embeds: Optional[torch.Tensor] = None, | ||
) -> tuple[torch.Tensor, torch.Tensor]: | ||
return self.model(input_ids, positions, hidden_states) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to llama_eagle.py
, the inputs_embeds
parameter is added to the forward
method's signature but is not used. This makes the multimodal support for this model incomplete and potentially buggy if a user tries to use it with multimodal inputs.
Please either fully implement the handling of inputs_embeds
or raise a NotImplementedError
if it's not None
to prevent silent failures.
inputs_embeds: Optional[torch.Tensor] = None,
) -> tuple[torch.Tensor, torch.Tensor]:
if inputs_embeds is not None:
raise NotImplementedError(
f"{type(self).__name__} does not support multimodal inputs yet.")
return self.model(input_ids, positions, hidden_states)
Signed-off-by: morgendave <morgendave@gmail.com>
8df05d0
to
d64bf91
Compare
Is #20591 supposed to be merged first? |
Yes, this would be rebased after that |
Purpose
Enable MM inference for EAGLE, targeting mllama4 in this PR but generally easy to extend to other models
Issue with this PR:
MM chunked prefill needs to be disabled, or set mbnt to a large number. There will be a follow up PR using unshifting eagle prefill fix this
Test Plan
vllm serve with benchmark testing
cmd
baseline without
--speculative-config
flagbenchmark cmd
Test Result
Eagle MM benchmark
Follow ups
Need to make offline inference work with vision datasets