-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
Support encoder-only models without KV-Cache #21270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata. This PR combines elements of PRs vllm-project#21088 and vllm-project#19988 Summary of changes: **Flash Attention Backend:** - Implement encoder self-attention support without using KV cache **Scheduler:** - Disable chunked prefill for models without KV cache **GPU Model Runner:** - Implement encoder-only attention metadata building for self-attention Related to: - V0 deprecation: vllm-project#18571 - 2025 Q3 roadmap: vllm-project#20336 Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for encoder-only models without a KV-cache. The changes are well-structured and cover the necessary modifications in the attention backend, scheduler, and model runner. I have identified areas where the implementation's strictness could limit future extensibility. Specifically, the error handling and assertions in GPUModelRunner
are too restrictive and should be made more flexible to accommodate potential future model architectures.
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
@DarkLight1337 this PR should enable support for all bert models except for the classifier models that require token type ids. But that can be left as a future PR as there are several implementation alternatives. Since the KV cache is disabled in this PR, it require much less changes than PR #19988 |
cc @WoosukKwon @LucasWilkinson it would be best for you two to review this to ensure that the refactoring fits your design. |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
my comments were addressed, but it needs review from others since I'm a co-author:
cc @WoosukKwon @LucasWilkinson it would be best for you two to review this to ensure that the refactoring fits your design.
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
self.vllm_config, Attention) | ||
for layer_name, attn_module in attention_layers.items(): | ||
if attn_module.attn_type == AttentionType.ENCODER_ONLY: | ||
attn_metadata[layer_name] = encoder_attn_metdata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attn_metadata[layer_name] = encoder_attn_metdata | |
attn_metadata[layer_name] = encoder_attn_metadata |
Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata.
This PR combines elements of PRs
#21088
and #19988
Summary of changes:
Flash Attention Backend:
Scheduler:
GPU Model Runner:
Related to:
This PR is co-authored with @russellb. It borrows all of the encoder-only attention code from his PR #21088 but leaves out the cross-encoder and encoder attention.
cc: @DarkLight1337