Skip to content

[V1] Hybrid allocator without prefix caching #20661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 13, 2025

Conversation

nopperl
Copy link
Contributor

@nopperl nopperl commented Jul 9, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

The HybridKVCacheCoordinator of the v1 engine currently only works with one full attention group + one other attention group. This prevents certain models from being executed using the v1 engine (like PLaMo2, which is a sliding window + Mamba hybrid). However, since the strict full attention group requirement comes from the prefix caching feature, I think it is reasonable to sidestep the requirement if prefix caching is disabled. This will safely enable v1 engine support for hybrid Mamba models like PLaMo2, since Mamba layers currently do not support prefix caching anyway.

Related RFC: #17140
@heheda12345

Test Plan

tests/models/language/generation/test_hybrid.py

Test Result

Tests passed locally.

(Optional) Documentation Update

To run PLaMo2 using the v1 engine, see #20660 .

…s disabled

Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @nopperl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the v1 engine's HybridKVCacheCoordinator to support a broader range of hybrid models, particularly those that do not utilize prefix caching. By making the strict KV cache group verification conditional on prefix caching being enabled, it safely allows models with different attention mechanisms, such as SWA + Mamba hybrids, to run on the v1 engine. It also adds a safeguard to ensure cache hit lookups are only attempted when caching is active.

Highlights

  • Hybrid Model Support: This PR enables the v1 engine to support hybrid models like those combining Sliding Window Attention (SWA) and Mamba layers (e.g., PLaMo2), specifically when prefix caching is disabled. This addresses a limitation where the HybridKVCacheCoordinator previously required a strict configuration of attention groups.
  • Conditional KV Cache Group Verification: The verify_and_split_kv_cache_groups method, which enforces strict attention group requirements, is now only called during initialization if prefix caching (enable_caching) is enabled. This allows models that don't use prefix caching (like Mamba layers) to bypass these requirements.
  • Prefix Caching Assertion: An assertion has been added to the find_longest_cache_hit method to ensure it is only called when prefix caching is enabled. This prevents logical errors if an attempt is made to find cache hits when caching is explicitly disabled.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the v1 label Jul 9, 2025
Copy link

github-actions bot commented Jul 9, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the use of hybrid models like SWA + Mamba2 with the v1 engine by correctly bypassing prefix caching logic when it's disabled. The changes are minimal, logical, and well-protected by an assertion. I have one medium-severity suggestion to improve the maintainability of the HybridKVCacheCoordinator class by ensuring all instance attributes are always initialized.

Comment on lines 224 to 225
if enable_caching:
self.verify_and_split_kv_cache_groups()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change correctly makes the setup for prefix caching conditional. However, it leads to a situation where a group of instance attributes (e.g., self.full_attention_group_ids, self.other_group_ids) are only defined if enable_caching is true.

This can make the class harder to maintain, as it may lead to AttributeError exceptions if other methods are added in the future that access these attributes without also checking self.enable_caching. While the new assertion in find_longest_cache_hit is a good safeguard for that specific method, the pattern of conditionally defining attributes can be fragile.

A more robust approach is to ensure all instance attributes are initialized in __init__, for example to None. This makes the object's structure consistent and improves readability and static analysis.

For example, you could add the following before this if statement:

self.full_attention_group_ids = None
self.other_group_ids = None

... and so on for all attributes set in verify_and_split_kv_cache_groups

Since this change would be outside the current diff, I'm leaving it as a suggestion for you to consider for improving future maintainability.

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can implement a new KVCacheCoordinatorNoPrefixCache and use it when prefix caching is disabled. We need it for both 0 kv cache group and >2 kv cache groups.
BTW it seems that PLaMo2 doesn't use SWA. Is the model code correct?

@nopperl
Copy link
Contributor Author

nopperl commented Jul 9, 2025

Thanks for the suggestion @heheda12345 , will do that!

BTW it seems that PLaMo2 doesn't use SWA. Is the model code correct?

Could you explain what you mean by that? I was under the impression that it will use SWA if cache_config.sliding_window is set.

@heheda12345
Copy link
Collaborator

Could you explain what you mean by that? I was under the impression that it will use SWA if cache_config.sliding_window is set.
Thanks. That's fine. pls ignore what I was asking.

@heheda12345
Copy link
Collaborator

@nopperl Can you sync with the author of #20577 to avoid duplicate of effort? They need a coordinator without prefix caching too.

nopperl added 2 commits July 11, 2025 16:14
…aching is disabled"

This reverts commit e5d6309.

Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
…ot used

Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
@nopperl
Copy link
Contributor Author

nopperl commented Jul 11, 2025

@heheda12345 I have introduced the KVCacheCoordinatorNoPrefixCache (using KVCacheCoordinator as base class).

Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!

@heheda12345 heheda12345 enabled auto-merge (squash) July 13, 2025 12:48
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 13, 2025
@heheda12345 heheda12345 changed the title [V1] Allow SWA + Mamba2 models without prefix caching [V1] Hybrid allocator without prefix caching Jul 13, 2025
@heheda12345
Copy link
Collaborator

@nopperl Please sync the branch with main to fix the async engine test.

Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
@heheda12345 heheda12345 merged commit 4bbfc36 into vllm-project:main Jul 13, 2025
63 of 64 checks passed
b8zhong pushed a commit to bzhng-development/vllm that referenced this pull request Jul 14, 2025
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants