-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[Attention] Make local attention backend agnostic #21093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Attention] Make local attention backend agnostic #21093
Conversation
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> fix Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the local attention mechanism to be backend-agnostic, which is a great improvement for code maintainability and extensibility. The approach of centralizing the virtual batch creation logic in vllm/v1/attention/backends/utils.py
and introducing a ChunkedLocalAttentionSpec
is well-designed.
I've found one critical issue in the implementation of the new make_local_attention_virtual_batches
function where the returned CommonAttentionMetadata
has inconsistent dimensions, which would likely cause runtime errors or incorrect attention calculations. I've provided a suggestion to fix this. Once that's addressed, the changes should be in good shape.
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a satisfying clean up! LGTM
Signed-off-by: WorldExplored <srreyansh.sethi@gmail.com>
Signed-off-by: Himanshu Jaju <hj@mistral.ai>
ChunkedLocalAttentionSpec): | ||
common_attn_metadata = make_local_attention_virtual_batches( | ||
kv_cache_group_spec.kv_cache_spec.attention_chunk_size, | ||
common_attn_metadata, self.cache_config.block_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LucasWilkinson Hybrid kv cache manager is not compatible with kv connectors and kv events. When these things are enabled, we'll fall back to one kv cache group with FullAttentionSpec, and mark FullAttentionSpec.attention_chunk_size
. I think there will be some bug here. You can test it by launching model with --disable-hybrid-kv-cache-manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up! Back in office tomorrow; will take a look 👍
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Make local attention backend agnostic now that #20466 has landed so we can turn on llama4 iRoPE for FlashInfer on Blackwell
Test Plan
Ruler eval
Test Result
This PR
Main
(Optional) Documentation Update