-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[V1] [Doc] Update V1 docs for Mamba models #20499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @tdoublep, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces an automated mechanism for determining the optimal attention block size for hybrid (attention/SSM) models within the V1 framework. This enhancement simplifies the user experience by removing the need for manual block size configuration and enables these models to be served via the vllm serve
CLI.
Highlights
- Automated Attention Block Size: Implemented an automated mechanism to determine the optimal attention block size for hybrid (attention/SSM) models, removing the need for manual configuration and simplifying user experience.
- Hybrid Model V1 Support: Enabled full support for hybrid attention/SSM models (e.g., Bamba, Falcon-H1, GraniteMoEHybrid, Nemotron-H, Zamba2) within the V1 framework, including compatibility with the
vllm serve
CLI. - Dynamic KV Cache Alignment: Refactored the KV cache specification logic to dynamically align the page sizes for attention and Mamba layers in hybrid models. This involves automatically adjusting the attention block size and padding the Mamba page size to ensure exact alignment.
- Documentation Updates: Updated the
supported_models.md
andv1_guide.md
documentation to reflect the new V1 support status for Mamba and hybrid models, clarifying their functional status and specific requirements (e.g., eager mode, disabling prefix caching).
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request automates the choice of attention block size for hybrid models in V1, simplifying the user experience and enabling the models to run via the vllm serve
CLI. The changes include updates to documentation and modifications to the V1 worker's GPU model runner to dynamically adjust attention block sizes. The code appears well-structured and addresses the objective effectively.
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache_config.block_size
is accessed here and there in vLLM, so we need to find a way to update this values safely. #19407 faces a similar issue about updating max_model_len
. Maybe you can talk with her.
I don't have a good idea yet but happy to discuss about it.
While debugging #20016 I looked into it quite a bit and was pretty confident that we don't read the block size out of the cache_config after we create the KVCacheSpec. I will check again though and think about it further. |
This pull request has merge conflicts that must be resolved before it can be |
@tdoublep For example,
vllm/vllm/v1/spec_decode/eagle.py Line 43 in 3eb4ad5
What about adding a user-defined function for each hybrid model to help detecting attention page size and mamba page size? For example, input is VllmConfig and output is the set of KVCacheSpec used in the model. |
Fully agree
Not-sure what you mean by user-defined here - the developer who contributes and maintains the modeling code, or the actual user who deploys a hybrid model with vLLM? It seems like the "correct" place to update the vLLM config is in the if is_hybrid(model_config):
attn_block_size, mamba_page_size_padded = parse_vllm_config_for_hybrid(vllm_config)
cache_config.block_size = attn_block_size
cache_config.mamba_page_size = mamba_page_size_padded And then the changes within the GPU model runner would be minimal, and we would ensure that any code that reads the attention block size from the cache config is still correct. WDYT? |
Sorry for the confusion. I mean "the developer who contributes and maintains the modeling code" but will be happy if you can implement a general I think it is not GPU-specific as other backends can use it in the future. I prefer to update the config here Line 4778 in 71d1d75
|
OK! Sounds good, let me give it another shot and get back to you. |
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@DarkLight1337 @heheda12345 Should we merge this PR to get the doc updates in while I continue working on the config stuff? I don't really think it makes sense to tie it together. If you agree, I would revert the non-doc changes and also add a note that hybrid models don't work via CLI yet. |
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@heheda12345 I took another go the config update. Now everything happens from within Note I can clean up the if/elif/else stuff when parsing the mamba config easily, but want to get your thoughts on this approach. |
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
vllm/model_executor/models/config.py
Outdated
"BambaForCausalLM": HybridAttentionMambaModelConfig, | ||
"GraniteMoeHybridForCausalLM": HybridAttentionMambaModelConfig, | ||
"NemotronHForCausalLM": NemotronHModelConfig, | ||
"Zamba2ForCausalLM": Zamba2ModelConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to require each new model to update this page. What about letting get_mamba_cache_shape to be an interface that should be implemented by each hybrid model? The abstractions you made now can be some useful utility function to minimize code duplication when implementing this interface for each model.
class IsHybrid(Protocol):
def get_mamba_cache_shape(cls, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, was thinking of something similar. I will have a go at it and open a new PR.
vllm/model_executor/models/config.py
Outdated
use_mla=model_config.use_mla).page_size_bytes | ||
|
||
# get mamba page size | ||
mamba_page_size = MambaSpec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the logic of getting MambaSpec here is different from that in gpu_model_runner, can you add some check in gpu_model_runner.get_kv_cache_spec to verify that these two are the same.
BTW I think we don't need to check FullAttentionSpec as exceptions will be raised if page size doesn't match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@DarkLight1337 @heheda12345 I've reverted the other changes and added a note about the CLI limitation for the hybrid mamba/attention models. I will continue refactoring the auto block size selection and open a new PR. |
docs/models/supported_models.md
Outdated
@@ -316,7 +316,7 @@ Specified using `--task generate`. | |||
| `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ | | |||
| `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | ✅︎ | | |||
| `BaiChuanForCausalLM` | Baichuan2, Baichuan | `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc. | ✅︎ | ✅︎ | ✅︎ | | |||
| `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | | | |||
| `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | ✅ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make the checkmark icon consistent with the other entries in this table? Same for the other tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checkmark icon looks identical to me. Could you paste a screenshot of the difference you are seeing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also visible in the git diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm let me just add suggestions in GitHub and you can commit the changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must be some unicode weirdness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks for that
Unicode icon weirdness Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Merged |
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Purpose
This PR automates the choice of attention block size when running hybrid (attention/SSM) models in V1. This simplifies things and unburdens the user from having to select the right block size.This change also enables running the models in V1 via thevllm serve
CLI.This PR just updates the V1 docs to reflect the current status of Mamba support on main. I will create another PR with the automated block size changes.
Test Plan
I've removed fixing of the block size from the existing V1 hybrid tests.Test Result
Should pass but let's see the CI.(Optional) Documentation Update
I've updated both the supported models and V1 guide documentation.
cc @heheda12345 @DarkLight1337 @tlrmchlsmth