-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
[V1] Enable Mamba2 layers other than MambaMixer2 in the v1 engine #20660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @nopperl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the flexibility of the v1 engine by enabling support for a broader range of Mamba2 layer implementations beyond just MambaMixer2
. It achieves this through the introduction of a new abstract base class for Mamba2 layers and a more centralized, configuration-driven approach to managing Mamba-specific parameters like chunk_size
.
Highlights
- Generalize Mamba2 Layer Support: Introduced an abstract base class
Mamba2Layer
from which all Mamba2 layers supporting the v1 engine can inherit. This allows the v1 engine to dynamically load various Mamba2 implementations, moving beyond the previous hardcoded dependency onMambaMixer2
. - Centralized Mamba Chunk Size Retrieval: Refactored the
Mamba2AttentionBackend
to retrieve thechunk_size
directly from the model's configuration (vllm.config.py
) rather than from an attribute of the Mamba layer itself. This reduces indirection and simplifies configuration. - MambaMixer2 Adaptation:
MambaMixer2
now inherits from the newMamba2Layer
abstract base class and no longer requireschunk_size
to be passed during its initialization, as this value is now obtained globally from the model configuration. - Model-Specific Updates: Updated several model implementations (Bamba, FalconH1, GraniteMoEHybrid, Mamba2, NemotronH, Zamba2) to remove the direct passing of
chunk_size
when instantiating their respective MambaMixer2 layers, aligning with the centralized configuration approach.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an abstract base class Mamba2Layer
to enable support for various Mamba2 layers in the v1 engine, enhancing extensibility. It also refactors the Mamba2AttentionBackend
to retrieve the chunk_size
from the model config, centralizing configuration. The changes are well-structured and improve the flexibility of the v1 engine.
self.chunk_size = runner.vllm_config.model_config.get_mamba_chunk_size( | ||
) | ||
assert self.chunk_size is not None, ( | ||
"chunk_size needs to be set in the model config for Mamba2 models") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider logging a warning message instead of raising an assertion error. This will allow the program to continue running, while still informing the user that there might be an issue with their configuration.
if self.chunk_size is None:
logger.warning("chunk_size needs to be set in the model config for Mamba2 models")
d754b7a
to
420066e
Compare
…mbaMixer2 in v1 engine Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
420066e
to
e3af68a
Compare
Why does Plamo2 model not use the MambaMixer2 layer? There is a to-do in the modeling code that it should be rebased: Could that be a better solution than introducing more abstractions? |
(The TODO comment was referring to the Mamba2 kernels, not the MambaMixer2 and is removed in #19674) In theory it is possible to refactor the MambaMixer2 to support PLaMo2. However, there are signficant differences between Plamo2MambaMixer and MambaMixer2 (as explained here: #14323 (comment)). We're unsure if it is prudent to further complicate the MambaMixer2 just in order to support a single model. From upstream perspective, this change might be less intrusive. (btw thanks for adding hybrid mamba support in the v1 engine @tdoublep !) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm only concerned about the class name
@tdoublep I think the abstraction is necessary for adding other state space models like minimax.
@nopperl https://github.com/pfnet/vllm/tree/plamo2-follow-up-v1 For these changes, as we need to modify MambaMixer2 a lot to support piece-wise cuda graph, I think you can make the PR after we finish updating MambaMixer2.
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much. Though LinearAttentionBase is more general, we are using "mamba" here and there and all of them should be changed if you decide to call it "LinearAttention".
I understand, makes sense! |
…lm-project#20660) Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
…lm-project#20660) Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
#19327 and #20016 enabled v1 support for Mamba2 and hybrid models. However, this currently only works for models using
MambaMixer2
, since it is hardcoded in two code paths. This prevents models with a custom Mamba2 layer (like PLaMo2) from using the v1 engine. This PR proposes to fix this problem with the following changes:Mamba2Layer
is introduced from which all Mamba2 layers which support the v1 engine can inherit. This allows to dynamically load these layers in the v1 engine instead of hardcodingMambaMixer2
.Mamba2AttentionBackend
now retrieves thechunk_size
using the model config instead of accessing the mamba layer's attribute.The core issue of this PR is also resolved without the second change, but I think it makes sense to include it since it reduces the indirection.
Related RFC: #17140
@heheda12345
Test Plan
tests/models/language/generation/test_hybrid.py
Test Result
Tests passed locally.
(Optional) Documentation Update
Run PLaMo2 with v1
To execute PLaMo2 using the v1 engine, first apply all of the following patches:
Then, run the following example script: