[V1] [Doc] Update V1 docs for Mamba models #20499

tdoublep · 2025-07-04T19:19:04Z

Purpose

~~This PR automates the choice of attention block size when running hybrid (attention/SSM) models in V1. This simplifies things and unburdens the user from having to select the right block size.~~

~~This change also enables running the models in V1 via the vllm serve CLI.~~

This PR just updates the V1 docs to reflect the current status of Mamba support on main. I will create another PR with the automated block size changes.

Test Plan

~~I've removed fixing of the block size from the existing V1 hybrid tests.~~

Test Result

~~Should pass but let's see the CI.~~

(Optional) Documentation Update

I've updated both the supported models and V1 guide documentation.

cc @heheda12345 @DarkLight1337 @tlrmchlsmth

github-actions · 2025-07-04T19:19:12Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @tdoublep, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an automated mechanism for determining the optimal attention block size for hybrid (attention/SSM) models within the V1 framework. This enhancement simplifies the user experience by removing the need for manual block size configuration and enables these models to be served via the vllm serve CLI.

Highlights

Automated Attention Block Size: Implemented an automated mechanism to determine the optimal attention block size for hybrid (attention/SSM) models, removing the need for manual configuration and simplifying user experience.
Hybrid Model V1 Support: Enabled full support for hybrid attention/SSM models (e.g., Bamba, Falcon-H1, GraniteMoEHybrid, Nemotron-H, Zamba2) within the V1 framework, including compatibility with the vllm serve CLI.
Dynamic KV Cache Alignment: Refactored the KV cache specification logic to dynamically align the page sizes for attention and Mamba layers in hybrid models. This involves automatically adjusting the attention block size and padding the Mamba page size to ensure exact alignment.
Documentation Updates: Updated the supported_models.md and v1_guide.md documentation to reflect the new V1 support status for Mamba and hybrid models, clarifying their functional status and specific requirements (e.g., eager mode, disabling prefix caching).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

gemini-code-assist

Code Review

This pull request automates the choice of attention block size for hybrid models in V1, simplifying the user experience and enabling the models to run via the vllm serve CLI. The changes include updates to documentation and modifications to the V1 worker's GPU model runner to dynamically adjust attention block sizes. The code appears well-structured and addresses the objective effectively.

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

heheda12345

cache_config.block_size is accessed here and there in vLLM, so we need to find a way to update this values safely. #19407 faces a similar issue about updating max_model_len. Maybe you can talk with her.

I don't have a good idea yet but happy to discuss about it.

docs/models/supported_models.md

docs/usage/v1_guide.md

vllm/v1/worker/gpu_model_runner.py

tdoublep · 2025-07-07T19:25:24Z

cache_config.block_size is accessed here and there in vLLM, so we need to find a way to update this values safely.

While debugging #20016 I looked into it quite a bit and was pretty confident that we don't read the block size out of the cache_config after we create the KVCacheSpec. I will check again though and think about it further.

mergify · 2025-07-08T03:15:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tdoublep.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

heheda12345 · 2025-07-08T03:52:11Z

@tdoublep For example,

vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Line 188 in 3eb4ad5

self.block_size = vllm_config.cache_config.block_size

and

vllm/vllm/v1/spec_decode/eagle.py

Line 43 in 3eb4ad5

self.block_size = vllm_config.cache_config.block_size

. Though each point is solvable. I want to have a solution to protect people from writing wrong code.

What about adding a user-defined function for each hybrid model to help detecting attention page size and mamba page size? For example, input is VllmConfig and output is the set of KVCacheSpec used in the model.

tdoublep · 2025-07-08T08:16:40Z

I want to have a solution to protect people from writing wrong code.

Fully agree

What about adding a user-defined function for each hybrid model to help detecting attention page size and mamba page size? For example, input is VllmConfig and output is the set of KVCacheSpec used in the model.

Not-sure what you mean by user-defined here - the developer who contributes and maintains the modeling code, or the actual user who deploys a hybrid model with vLLM?

It seems like the "correct" place to update the vLLM config is in the check_and_update_config function from the platform class (e.g., here). Since hybrid models are only supported on GPU, I would propose we a create a function as you suggest (although I think this function can be common for all hybrid models) and call it from check_and_update_config for CUDA. It would do something like:

if is_hybrid(model_config):
    attn_block_size, mamba_page_size_padded = parse_vllm_config_for_hybrid(vllm_config)
    cache_config.block_size = attn_block_size
    cache_config.mamba_page_size = mamba_page_size_padded

And then the changes within the GPU model runner would be minimal, and we would ensure that any code that reads the attention block size from the cache config is still correct.

WDYT?

heheda12345 · 2025-07-08T08:38:42Z

Sorry for the confusion. I mean "the developer who contributes and maintains the modeling code" but will be happy if you can implement a general parse_vllm_config_for_hybrid for all hybrid mamba models. BTW, I prefer to call these modelshybrid_attention_mamba, as sometimes we also use hybrid for full attention + sliding window attention.

I think it is not GPU-specific as other backends can use it in the future. I prefer to update the config here

vllm/vllm/config.py

Line 4778 in 71d1d75

def try_verify_and_update_config(self):

but you can try to find a better place.

tdoublep · 2025-07-08T09:28:10Z

OK! Sounds good, let me give it another shot and get back to you.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

docs/usage/v1_guide.md

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-07-08T18:16:59Z

@DarkLight1337 @heheda12345 Should we merge this PR to get the doc updates in while I continue working on the config stuff? I don't really think it makes sense to tie it together.

If you agree, I would revert the non-doc changes and also add a note that hybrid models don't work via CLI yet.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-07-08T20:46:41Z

@heheda12345 I took another go the config update. Now everything happens from within verify_and_update_config. There are now no changes to the GPU model runner code at all.

Note I can clean up the if/elif/else stuff when parsing the mamba config easily, but want to get your thoughts on this approach.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

heheda12345 · 2025-07-09T04:51:55Z

vllm/model_executor/models/config.py

+    "BambaForCausalLM": HybridAttentionMambaModelConfig,
+    "GraniteMoeHybridForCausalLM": HybridAttentionMambaModelConfig,
+    "NemotronHForCausalLM": NemotronHModelConfig,
+    "Zamba2ForCausalLM": Zamba2ModelConfig,


I don't want to require each new model to update this page. What about letting get_mamba_cache_shape to be an interface that should be implemented by each hybrid model? The abstractions you made now can be some useful utility function to minimize code duplication when implementing this interface for each model.

class IsHybrid(Protocol): def get_mamba_cache_shape(cls, ...)

Yeah, was thinking of something similar. I will have a go at it and open a new PR.

heheda12345 · 2025-07-09T04:55:51Z

vllm/model_executor/models/config.py

+            use_mla=model_config.use_mla).page_size_bytes
+
+        # get mamba page size
+        mamba_page_size = MambaSpec(


As the logic of getting MambaSpec here is different from that in gpu_model_runner, can you add some check in gpu_model_runner.get_kv_cache_spec to verify that these two are the same.
BTW I think we don't need to check FullAttentionSpec as exceptions will be raised if page size doesn't match.

heheda12345 · 2025-07-09T05:00:17Z

Should we merge this PR to get the doc updates in while I continue working on the config stuff?
I'm OK with that, but please remember to update the doc in your new PR.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-07-09T07:18:44Z

@DarkLight1337 @heheda12345 I've reverted the other changes and added a note about the CLI limitation for the hybrid mamba/attention models.

I will continue refactoring the auto block size selection and open a new PR.

DarkLight1337 · 2025-07-09T07:43:28Z

docs/models/supported_models.md

@@ -316,7 +316,7 @@ Specified using `--task generate`.
 | `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
 | `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | | ✅︎ | ✅︎ |
 | `BaiChuanForCausalLM` | Baichuan2, Baichuan | `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
-| `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | |
+| `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | ✅︎ | ✅︎ | ✅ |


Can you make the checkmark icon consistent with the other entries in this table? Same for the other tables

The checkmark icon looks identical to me. Could you paste a screenshot of the difference you are seeing?

It's also visible in the git diff

lol, wth. this is what git diff (and the table) shows for me:

Hmm let me just add suggestions in GitHub and you can commit the changes

must be some unicode weirdness

done, thanks for that

docs/models/supported_models.md

Unicode icon weirdness Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

DarkLight1337 · 2025-07-09T08:02:55Z

Merged

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>

tdoublep requested review from hmellor, DarkLight1337, ywang96, WoosukKwon, robertgshaw2-redhat, njhill, comaniac and alexm-redhat as code owners July 4, 2025 19:19

gemini-code-assist bot reviewed Jul 4, 2025

View reviewed changes

mergify bot added documentation Improvements or additions to documentation v1 labels Jul 4, 2025

Automate choice of attention block size; update docs

f9a5f4e

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep force-pushed the auto-block-size branch from 8b0732e to f9a5f4e Compare July 4, 2025 19:20

gemini-code-assist bot reviewed Jul 4, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

tdoublep added 3 commits July 4, 2025 20:13

Fix typo

e618204

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

More verbose logging / asserts

600ec11

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Update docs re: FlashInfer

ec6d840

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

heheda12345 reviewed Jul 5, 2025

View reviewed changes

mergify bot added the needs-rebase label Jul 8, 2025

Resolve conflicts with main; Update docs based on review feedback

a5c542f

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

mergify bot removed the needs-rebase label Jul 8, 2025

DarkLight1337 reviewed Jul 8, 2025

View reviewed changes

docs/usage/v1_guide.md Outdated Show resolved Hide resolved

Address review feedback

84daa12

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Second attempt at auto-setting attention block size

5ea6bed

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep added 2 commits July 8, 2025 20:57

Fix logic slightly

1df7319

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Cleanup

d04dcfe

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep requested review from simon-mo, youkaichao, mgoin, tlrmchlsmth and houseroad as code owners July 9, 2025 04:23

tdoublep added 2 commits July 9, 2025 04:33

Merge branch 'main' into auto-block-size

11680bf

Remove unused import

2ff9a09

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

heheda12345 reviewed Jul 9, 2025

View reviewed changes

Revert other changs; update docs

bcd9376

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep changed the title ~~[V1] [Doc] Automated choice of attention block size for hybrid models in V1~~ [V1] [Doc] Update V1 docs for Mamba models Jul 9, 2025

DarkLight1337 approved these changes Jul 9, 2025

View reviewed changes

DarkLight1337 reviewed Jul 9, 2025

View reviewed changes

Apply suggestions from code review

61ec90b

Unicode icon weirdness Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep force-pushed the auto-block-size branch from bc3ffa1 to 61ec90b Compare July 9, 2025 08:00

vllm-bot merged commit 5358cce into vllm-project:main Jul 9, 2025
5 of 9 checks passed

tdoublep deleted the auto-block-size branch July 9, 2025 08:03

ant-yy pushed a commit to ant-yy/vllm that referenced this pull request Jul 9, 2025

[V1] [Doc] Update V1 docs for Mamba models (vllm-project#20499)

d57795f

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

tdoublep added a commit to tdoublep/vllm that referenced this pull request Jul 11, 2025

Moving work from PR vllm-project#20499

fda49c3

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Chen-zexi pushed a commit to Chen-zexi/vllm that referenced this pull request Jul 13, 2025

[V1] [Doc] Update V1 docs for Mamba models (vllm-project#20499)

10b8a8a

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Uh oh!

[V1] [Doc] Update V1 docs for Mamba models #20499

[V1] [Doc] Update V1 docs for Mamba models #20499

Uh oh!

Conversation

tdoublep commented Jul 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tdoublep commented Jul 7, 2025

Uh oh!

mergify bot commented Jul 8, 2025

Uh oh!

heheda12345 commented Jul 8, 2025

Uh oh!

tdoublep commented Jul 8, 2025

Uh oh!

heheda12345 commented Jul 8, 2025

Uh oh!

tdoublep commented Jul 8, 2025

Uh oh!

Uh oh!

tdoublep commented Jul 8, 2025

Uh oh!

tdoublep commented Jul 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Jul 9, 2025

Uh oh!

tdoublep commented Jul 9, 2025

Uh oh!

DarkLight1337 Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

tdoublep commented Jul 4, 2025 •

edited by github-actions bot

Loading

DarkLight1337 Jul 9, 2025 •

edited

Loading