-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
[V1] Hybrid allocator without prefix caching #20661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…s disabled Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @nopperl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request enhances the v1 engine's HybridKVCacheCoordinator
to support a broader range of hybrid models, particularly those that do not utilize prefix caching. By making the strict KV cache group verification conditional on prefix caching being enabled, it safely allows models with different attention mechanisms, such as SWA + Mamba hybrids, to run on the v1 engine. It also adds a safeguard to ensure cache hit lookups are only attempted when caching is active.
Highlights
- Hybrid Model Support: This PR enables the v1 engine to support hybrid models like those combining Sliding Window Attention (SWA) and Mamba layers (e.g., PLaMo2), specifically when prefix caching is disabled. This addresses a limitation where the
HybridKVCacheCoordinator
previously required a strict configuration of attention groups. - Conditional KV Cache Group Verification: The
verify_and_split_kv_cache_groups
method, which enforces strict attention group requirements, is now only called during initialization if prefix caching (enable_caching
) is enabled. This allows models that don't use prefix caching (like Mamba layers) to bypass these requirements. - Prefix Caching Assertion: An assertion has been added to the
find_longest_cache_hit
method to ensure it is only called when prefix caching is enabled. This prevents logical errors if an attempt is made to find cache hits when caching is explicitly disabled.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables the use of hybrid models like SWA + Mamba2 with the v1 engine by correctly bypassing prefix caching logic when it's disabled. The changes are minimal, logical, and well-protected by an assertion. I have one medium-severity suggestion to improve the maintainability of the HybridKVCacheCoordinator
class by ensuring all instance attributes are always initialized.
vllm/v1/core/kv_cache_coordinator.py
Outdated
if enable_caching: | ||
self.verify_and_split_kv_cache_groups() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly makes the setup for prefix caching conditional. However, it leads to a situation where a group of instance attributes (e.g., self.full_attention_group_ids
, self.other_group_ids
) are only defined if enable_caching
is true.
This can make the class harder to maintain, as it may lead to AttributeError
exceptions if other methods are added in the future that access these attributes without also checking self.enable_caching
. While the new assertion in find_longest_cache_hit
is a good safeguard for that specific method, the pattern of conditionally defining attributes can be fragile.
A more robust approach is to ensure all instance attributes are initialized in __init__
, for example to None
. This makes the object's structure consistent and improves readability and static analysis.
For example, you could add the following before this if
statement:
self.full_attention_group_ids = None
self.other_group_ids = None
... and so on for all attributes set in verify_and_split_kv_cache_groups
Since this change would be outside the current diff, I'm leaving it as a suggestion for you to consider for improving future maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can implement a new KVCacheCoordinatorNoPrefixCache and use it when prefix caching is disabled. We need it for both 0 kv cache group and >2 kv cache groups.
BTW it seems that PLaMo2 doesn't use SWA. Is the model code correct?
Thanks for the suggestion @heheda12345 , will do that!
Could you explain what you mean by that? I was under the impression that it will use SWA if |
|
…aching is disabled" This reverts commit e5d6309. Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
…ot used Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
@heheda12345 I have introduced the |
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution!
@nopperl Please sync the branch with main to fix the async engine test. |
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
The
HybridKVCacheCoordinator
of the v1 engine currently only works with one full attention group + one other attention group. This prevents certain models from being executed using the v1 engine (like PLaMo2, which is a sliding window + Mamba hybrid). However, since the strict full attention group requirement comes from the prefix caching feature, I think it is reasonable to sidestep the requirement if prefix caching is disabled. This will safely enable v1 engine support for hybrid Mamba models like PLaMo2, since Mamba layers currently do not support prefix caching anyway.Related RFC: #17140
@heheda12345
Test Plan
tests/models/language/generation/test_hybrid.py
Test Result
Tests passed locally.
(Optional) Documentation Update
To run PLaMo2 using the v1 engine, see #20660 .