Skip to content

Support encoder-only models without KV-Cache #21270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

maxdebayser
Copy link
Contributor

@maxdebayser maxdebayser commented Jul 20, 2025

Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata.

This PR combines elements of PRs
#21088
and #19988

Summary of changes:

Flash Attention Backend:

  • Implement encoder self-attention support without using KV cache

Scheduler:

  • Disable chunked prefill for models without KV cache

GPU Model Runner:

  • Implement encoder-only attention metadata building for self-attention

Related to:

This PR is co-authored with @russellb. It borrows all of the encoder-only attention code from his PR #21088 but leaves out the cross-encoder and encoder attention.

cc: @DarkLight1337

Add support for encoder models such as BERT which don't support
a KV cache due to the non-causal attention. Since the KV Cache
Spec is used to build the attention metadata for decoder models,
this PR initializes the attention metadata builds for encoder-only
models directly from the layers and adds a function to build the
attention metadata.

This PR combines elements of PRs
vllm-project#21088
and vllm-project#19988

Summary of changes:

**Flash Attention Backend:**
- Implement encoder self-attention support without using KV cache

**Scheduler:**
- Disable chunked prefill for models without KV cache

**GPU Model Runner:**
- Implement encoder-only attention metadata building for self-attention

Related to:
- V0 deprecation: vllm-project#18571
- 2025 Q3 roadmap: vllm-project#20336

Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Jul 20, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for encoder-only models without a KV-cache. The changes are well-structured and cover the necessary modifications in the attention backend, scheduler, and model runner. I have identified areas where the implementation's strictness could limit future extensibility. Specifically, the error handling and assertions in GPUModelRunner are too restrictive and should be made more flexible to accommodate potential future model architectures.

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
@maxdebayser
Copy link
Contributor Author

maxdebayser commented Jul 21, 2025

@DarkLight1337 this PR should enable support for all bert models except for the classifier models that require token type ids. But that can be left as a future PR as there are several implementation alternatives. Since the KV cache is disabled in this PR, it require much less changes than PR #19988

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 21, 2025

cc @WoosukKwon @LucasWilkinson it would be best for you two to review this to ensure that the refactoring fits your design.

Copy link

mergify bot commented Jul 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 21, 2025
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
@mergify mergify bot removed the needs-rebase label Jul 21, 2025
@russellb russellb dismissed their stale review July 22, 2025 01:03

my comments were addressed, but it needs review from others since I'm a co-author:

cc @WoosukKwon @LucasWilkinson it would be best for you two to review this to ensure that the refactoring fits your design.

@russellb russellb added this to the v0.10.0 milestone Jul 22, 2025
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
self.vllm_config, Attention)
for layer_name, attn_module in attention_layers.items():
if attn_module.attn_type == AttentionType.ENCODER_ONLY:
attn_metadata[layer_name] = encoder_attn_metdata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attn_metadata[layer_name] = encoder_attn_metdata
attn_metadata[layer_name] = encoder_attn_metadata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants