[V1] Hybrid allocator without prefix caching #20661

nopperl · 2025-07-09T03:30:04Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

The HybridKVCacheCoordinator of the v1 engine currently only works with one full attention group + one other attention group. This prevents certain models from being executed using the v1 engine (like PLaMo2, which is a sliding window + Mamba hybrid). However, since the strict full attention group requirement comes from the prefix caching feature, I think it is reasonable to sidestep the requirement if prefix caching is disabled. This will safely enable v1 engine support for hybrid Mamba models like PLaMo2, since Mamba layers currently do not support prefix caching anyway.

Related RFC: #17140
@heheda12345

Test Plan

tests/models/language/generation/test_hybrid.py

Test Result

Tests passed locally.

(Optional) Documentation Update

To run PLaMo2 using the v1 engine, see #20660 .

…s disabled Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>

gemini-code-assist

Summary of Changes

Hello @nopperl, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the v1 engine's HybridKVCacheCoordinator to support a broader range of hybrid models, particularly those that do not utilize prefix caching. By making the strict KV cache group verification conditional on prefix caching being enabled, it safely allows models with different attention mechanisms, such as SWA + Mamba hybrids, to run on the v1 engine. It also adds a safeguard to ensure cache hit lookups are only attempted when caching is active.

Highlights

Hybrid Model Support: This PR enables the v1 engine to support hybrid models like those combining Sliding Window Attention (SWA) and Mamba layers (e.g., PLaMo2), specifically when prefix caching is disabled. This addresses a limitation where the HybridKVCacheCoordinator previously required a strict configuration of attention groups.
Conditional KV Cache Group Verification: The verify_and_split_kv_cache_groups method, which enforces strict attention group requirements, is now only called during initialization if prefix caching (enable_caching) is enabled. This allows models that don't use prefix caching (like Mamba layers) to bypass these requirements.
Prefix Caching Assertion: An assertion has been added to the find_longest_cache_hit method to ensure it is only called when prefix caching is enabled. This prevents logical errors if an attempt is made to find cache hits when caching is explicitly disabled.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-07-09T03:31:01Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request enables the use of hybrid models like SWA + Mamba2 with the v1 engine by correctly bypassing prefix caching logic when it's disabled. The changes are minimal, logical, and well-protected by an assertion. I have one medium-severity suggestion to improve the maintainability of the HybridKVCacheCoordinator class by ensuring all instance attributes are always initialized.

gemini-code-assist · 2025-07-09T03:31:46Z

vllm/v1/core/kv_cache_coordinator.py

+        if enable_caching:
+            self.verify_and_split_kv_cache_groups()


This change correctly makes the setup for prefix caching conditional. However, it leads to a situation where a group of instance attributes (e.g., self.full_attention_group_ids, self.other_group_ids) are only defined if enable_caching is true.

This can make the class harder to maintain, as it may lead to AttributeError exceptions if other methods are added in the future that access these attributes without also checking self.enable_caching. While the new assertion in find_longest_cache_hit is a good safeguard for that specific method, the pattern of conditionally defining attributes can be fragile.

A more robust approach is to ensure all instance attributes are initialized in __init__, for example to None. This makes the object's structure consistent and improves readability and static analysis.

For example, you could add the following before this if statement:

self.full_attention_group_ids = None
self.other_group_ids = None

... and so on for all attributes set in verify_and_split_kv_cache_groups

Since this change would be outside the current diff, I'm leaving it as a suggestion for you to consider for improving future maintainability.

heheda12345

I think you can implement a new KVCacheCoordinatorNoPrefixCache and use it when prefix caching is disabled. We need it for both 0 kv cache group and >2 kv cache groups.
BTW it seems that PLaMo2 doesn't use SWA. Is the model code correct?

nopperl · 2025-07-09T12:04:30Z

Thanks for the suggestion @heheda12345 , will do that!

BTW it seems that PLaMo2 doesn't use SWA. Is the model code correct?

Could you explain what you mean by that? I was under the impression that it will use SWA if cache_config.sliding_window is set.

heheda12345 · 2025-07-09T17:10:00Z

Could you explain what you mean by that? I was under the impression that it will use SWA if cache_config.sliding_window is set.
Thanks. That's fine. pls ignore what I was asking.

heheda12345 · 2025-07-10T14:25:47Z

@nopperl Can you sync with the author of #20577 to avoid duplicate of effort? They need a coordinator without prefix caching too.

…aching is disabled" This reverts commit e5d6309. Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>

…ot used Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>

nopperl · 2025-07-11T07:43:49Z

@heheda12345 I have introduced the KVCacheCoordinatorNoPrefixCache (using KVCacheCoordinator as base class).

vllm/v1/core/kv_cache_coordinator.py

Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>

heheda12345

Thanks for your contribution!

heheda12345 · 2025-07-13T13:50:44Z

@nopperl Please sync the branch with main to fix the async engine test.

Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>

Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>

skip strict HybridKVCacheCoordinator verification if prefix caching i…

e5d6309

…s disabled Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>

nopperl requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners July 9, 2025 03:30

gemini-code-assist bot reviewed Jul 9, 2025

View reviewed changes

mergify bot added the v1 label Jul 9, 2025

gemini-code-assist bot reviewed Jul 9, 2025

View reviewed changes

nopperl mentioned this pull request Jul 9, 2025

[V1] Enable Mamba2 layers other than MambaMixer2 in the v1 engine #20660

Merged

4 tasks

heheda12345 reviewed Jul 9, 2025

View reviewed changes

heheda12345 mentioned this pull request Jul 10, 2025

[Core][Model] PrithviMAE Enablement on vLLM v1 engine (with zero kv_cache_groups) #20577

Open

nopperl added 2 commits July 11, 2025 16:14

Revert "skip strict HybridKVCacheCoordinator verification if prefix c…

6323189

…aching is disabled" This reverts commit e5d6309. Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>

introduce kv cache coordinator for settings where prefix caching is n…

76e0c05

…ot used Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>

heheda12345 reviewed Jul 11, 2025

View reviewed changes

vllm/v1/core/kv_cache_coordinator.py Show resolved Hide resolved

cache number of single type managers

8d1aa20

Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>

christian-pinto mentioned this pull request Jul 11, 2025

[v1][core]Support for attention free models #20811

Open

heheda12345 approved these changes Jul 13, 2025

View reviewed changes

heheda12345 enabled auto-merge (squash) July 13, 2025 12:48

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 13, 2025

heheda12345 changed the title ~~[V1] Allow SWA + Mamba2 models without prefix caching~~ [V1] Hybrid allocator without prefix caching Jul 13, 2025

Merge branch 'main' into v1-allow-swa-mamba

88254df

Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>

heheda12345 merged commit 4bbfc36 into vllm-project:main Jul 13, 2025
63 of 64 checks passed

b8zhong pushed a commit to bzhng-development/vllm that referenced this pull request Jul 14, 2025

[V1] Hybrid allocator without prefix caching (vllm-project#20661)

cdd10f1

Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1] Hybrid allocator without prefix caching #20661

[V1] Hybrid allocator without prefix caching #20661

nopperl commented Jul 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Jul 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 9, 2025

Uh oh!

heheda12345 left a comment

Uh oh!

nopperl commented Jul 9, 2025

Uh oh!

heheda12345 commented Jul 9, 2025

Uh oh!

heheda12345 commented Jul 10, 2025

Uh oh!

nopperl commented Jul 11, 2025

Uh oh!

Uh oh!

heheda12345 left a comment

Uh oh!

heheda12345 commented Jul 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[V1] Hybrid allocator without prefix caching #20661

[V1] Hybrid allocator without prefix caching #20661

Conversation

nopperl commented Jul 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Jul 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 9, 2025

Choose a reason for hiding this comment

... and so on for all attributes set in verify_and_split_kv_cache_groups

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

nopperl commented Jul 9, 2025

Uh oh!

heheda12345 commented Jul 9, 2025

Uh oh!

heheda12345 commented Jul 10, 2025

Uh oh!

nopperl commented Jul 11, 2025

Uh oh!

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Jul 13, 2025

Uh oh!

Uh oh!

Uh oh!

nopperl commented Jul 9, 2025 •

edited by github-actions bot

Loading