Skip to content

[V1] [Doc] Update V1 docs for Mamba models #20499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 9, 2025

Conversation

tdoublep
Copy link
Member

@tdoublep tdoublep commented Jul 4, 2025

Purpose

This PR automates the choice of attention block size when running hybrid (attention/SSM) models in V1. This simplifies things and unburdens the user from having to select the right block size.

This change also enables running the models in V1 via the vllm serve CLI.

This PR just updates the V1 docs to reflect the current status of Mamba support on main. I will create another PR with the automated block size changes.

Test Plan

I've removed fixing of the block size from the existing V1 hybrid tests.

Test Result

Should pass but let's see the CI.

(Optional) Documentation Update

I've updated both the supported models and V1 guide documentation.

cc @heheda12345 @DarkLight1337 @tlrmchlsmth

Copy link

github-actions bot commented Jul 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @tdoublep, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an automated mechanism for determining the optimal attention block size for hybrid (attention/SSM) models within the V1 framework. This enhancement simplifies the user experience by removing the need for manual block size configuration and enables these models to be served via the vllm serve CLI.

Highlights

  • Automated Attention Block Size: Implemented an automated mechanism to determine the optimal attention block size for hybrid (attention/SSM) models, removing the need for manual configuration and simplifying user experience.
  • Hybrid Model V1 Support: Enabled full support for hybrid attention/SSM models (e.g., Bamba, Falcon-H1, GraniteMoEHybrid, Nemotron-H, Zamba2) within the V1 framework, including compatibility with the vllm serve CLI.
  • Dynamic KV Cache Alignment: Refactored the KV cache specification logic to dynamically align the page sizes for attention and Mamba layers in hybrid models. This involves automatically adjusting the attention block size and padding the Mamba page size to ensure exact alignment.
  • Documentation Updates: Updated the supported_models.md and v1_guide.md documentation to reflect the new V1 support status for Mamba and hybrid models, clarifying their functional status and specific requirements (e.g., eager mode, disabling prefix caching).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Jul 4, 2025
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request automates the choice of attention block size for hybrid models in V1, simplifying the user experience and enabling the models to run via the vllm serve CLI. The changes include updates to documentation and modifications to the V1 worker's GPU model runner to dynamically adjust attention block sizes. The code appears well-structured and addresses the objective effectively.

tdoublep added 3 commits July 4, 2025 20:13
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache_config.block_size is accessed here and there in vLLM, so we need to find a way to update this values safely. #19407 faces a similar issue about updating max_model_len. Maybe you can talk with her.

I don't have a good idea yet but happy to discuss about it.

@tdoublep
Copy link
Member Author

tdoublep commented Jul 7, 2025

cache_config.block_size is accessed here and there in vLLM, so we need to find a way to update this values safely.

While debugging #20016 I looked into it quite a bit and was pretty confident that we don't read the block size out of the cache_config after we create the KVCacheSpec. I will check again though and think about it further.

Copy link

mergify bot commented Jul 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tdoublep.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 8, 2025
@heheda12345
Copy link
Collaborator

@tdoublep For example,

self.block_size = vllm_config.cache_config.block_size
and
self.block_size = vllm_config.cache_config.block_size
. Though each point is solvable. I want to have a solution to protect people from writing wrong code.

What about adding a user-defined function for each hybrid model to help detecting attention page size and mamba page size? For example, input is VllmConfig and output is the set of KVCacheSpec used in the model.

@tdoublep
Copy link
Member Author

tdoublep commented Jul 8, 2025

I want to have a solution to protect people from writing wrong code.

Fully agree

What about adding a user-defined function for each hybrid model to help detecting attention page size and mamba page size? For example, input is VllmConfig and output is the set of KVCacheSpec used in the model.

Not-sure what you mean by user-defined here - the developer who contributes and maintains the modeling code, or the actual user who deploys a hybrid model with vLLM?

It seems like the "correct" place to update the vLLM config is in the check_and_update_config function from the platform class (e.g., here). Since hybrid models are only supported on GPU, I would propose we a create a function as you suggest (although I think this function can be common for all hybrid models) and call it from check_and_update_config for CUDA. It would do something like:

if is_hybrid(model_config):
    attn_block_size, mamba_page_size_padded = parse_vllm_config_for_hybrid(vllm_config)
    cache_config.block_size = attn_block_size
    cache_config.mamba_page_size = mamba_page_size_padded

And then the changes within the GPU model runner would be minimal, and we would ensure that any code that reads the attention block size from the cache config is still correct.

WDYT?

@heheda12345
Copy link
Collaborator

Sorry for the confusion. I mean "the developer who contributes and maintains the modeling code" but will be happy if you can implement a general parse_vllm_config_for_hybrid for all hybrid mamba models. BTW, I prefer to call these modelshybrid_attention_mamba, as sometimes we also use hybrid for full attention + sliding window attention.

I think it is not GPU-specific as other backends can use it in the future. I prefer to update the config here

def try_verify_and_update_config(self):
but you can try to find a better place.

@tdoublep
Copy link
Member Author

tdoublep commented Jul 8, 2025

OK! Sounds good, let me give it another shot and get back to you.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@mergify mergify bot removed the needs-rebase label Jul 8, 2025
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@tdoublep
Copy link
Member Author

tdoublep commented Jul 8, 2025

@DarkLight1337 @heheda12345 Should we merge this PR to get the doc updates in while I continue working on the config stuff? I don't really think it makes sense to tie it together.

If you agree, I would revert the non-doc changes and also add a note that hybrid models don't work via CLI yet.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@tdoublep
Copy link
Member Author

tdoublep commented Jul 8, 2025

@heheda12345 I took another go the config update. Now everything happens from within verify_and_update_config. There are now no changes to the GPU model runner code at all.

Note I can clean up the if/elif/else stuff when parsing the mamba config easily, but want to get your thoughts on this approach.

tdoublep added 2 commits July 8, 2025 20:57
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
tdoublep added 2 commits July 9, 2025 04:33
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
"BambaForCausalLM": HybridAttentionMambaModelConfig,
"GraniteMoeHybridForCausalLM": HybridAttentionMambaModelConfig,
"NemotronHForCausalLM": NemotronHModelConfig,
"Zamba2ForCausalLM": Zamba2ModelConfig,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to require each new model to update this page. What about letting get_mamba_cache_shape to be an interface that should be implemented by each hybrid model? The abstractions you made now can be some useful utility function to minimize code duplication when implementing this interface for each model.

class IsHybrid(Protocol):
     def get_mamba_cache_shape(cls, ...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was thinking of something similar. I will have a go at it and open a new PR.

use_mla=model_config.use_mla).page_size_bytes

# get mamba page size
mamba_page_size = MambaSpec(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the logic of getting MambaSpec here is different from that in gpu_model_runner, can you add some check in gpu_model_runner.get_kv_cache_spec to verify that these two are the same.
BTW I think we don't need to check FullAttentionSpec as exceptions will be raised if page size doesn't match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@heheda12345
Copy link
Collaborator

Should we merge this PR to get the doc updates in while I continue working on the config stuff?
I'm OK with that, but please remember to update the doc in your new PR.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@tdoublep tdoublep changed the title [V1] [Doc] Automated choice of attention block size for hybrid models in V1 [V1] [Doc] Update V1 docs for Mamba models Jul 9, 2025
@tdoublep
Copy link
Member Author

tdoublep commented Jul 9, 2025

@DarkLight1337 @heheda12345 I've reverted the other changes and added a note about the CLI limitation for the hybrid mamba/attention models.

I will continue refactoring the auto block size selection and open a new PR.

@@ -316,7 +316,7 @@ Specified using `--task generate`.
| `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | ✅︎ |
| `BaiChuanForCausalLM` | Baichuan2, Baichuan | `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | |
| `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | |
Copy link
Member

@DarkLight1337 DarkLight1337 Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the checkmark icon consistent with the other entries in this table? Same for the other tables

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checkmark icon looks identical to me. Could you paste a screenshot of the difference you are seeing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also visible in the git diff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, wth. this is what git diff (and the table) shows for me:
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm let me just add suggestions in GitHub and you can commit the changes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must be some unicode weirdness

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks for that

Unicode icon weirdness

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@vllm-bot vllm-bot merged commit 5358cce into vllm-project:main Jul 9, 2025
5 of 9 checks passed
@DarkLight1337
Copy link
Member

Merged

@tdoublep tdoublep deleted the auto-block-size branch July 9, 2025 08:03
ant-yy pushed a commit to ant-yy/vllm that referenced this pull request Jul 9, 2025
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
tdoublep added a commit to tdoublep/vllm that referenced this pull request Jul 11, 2025
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Chen-zexi pushed a commit to Chen-zexi/vllm that referenced this pull request Jul 13, 2025
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
patrickvonplaten pushed a commit to patrickvonplaten/vllm that referenced this pull request Jul 15, 2025
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants