Support encoder-only models without KV-Cache #21270

maxdebayser · 2025-07-20T23:14:04Z

Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata.

This PR combines elements of PRs
#21088
and #19988

Summary of changes:

Flash Attention Backend:

Implement encoder self-attention support without using KV cache

Scheduler:

Disable chunked prefill for models without KV cache

GPU Model Runner:

Implement encoder-only attention metadata building for self-attention

Related to:

V0 deprecation: [RFC]: Deprecating vLLM V0 #18571
2025 Q3 roadmap: [Roadmap] vLLM Roadmap Q3 2025 #20336

This PR is co-authored with @russellb. It borrows all of the encoder-only attention code from his PR #21088 but leaves out the cross-encoder and encoder attention.

cc: @DarkLight1337

Add support for encoder models such as BERT which don't support a KV cache due to the non-causal attention. Since the KV Cache Spec is used to build the attention metadata for decoder models, this PR initializes the attention metadata builds for encoder-only models directly from the layers and adds a function to build the attention metadata. This PR combines elements of PRs vllm-project#21088 and vllm-project#19988 Summary of changes: **Flash Attention Backend:** - Implement encoder self-attention support without using KV cache **Scheduler:** - Disable chunked prefill for models without KV cache **GPU Model Runner:** - Implement encoder-only attention metadata building for self-attention Related to: - V0 deprecation: vllm-project#18571 - 2025 Q3 roadmap: vllm-project#20336 Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com>

github-actions · 2025-07-20T23:14:11Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces support for encoder-only models without a KV-cache. The changes are well-structured and cover the necessary modifications in the attention backend, scheduler, and model runner. I have identified areas where the implementation's strictness could limit future extensibility. Specifically, the error handling and assertions in GPUModelRunner are too restrictive and should be made more flexible to accommodate potential future model architectures.

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser · 2025-07-21T13:35:24Z

@DarkLight1337 this PR should enable support for all bert models except for the classifier models that require token type ids. But that can be left as a future PR as there are several implementation alternatives. Since the KV cache is disabled in this PR, it require much less changes than PR #19988

tests/entrypoints/openai/test_rerank.py

DarkLight1337 · 2025-07-21T13:48:52Z

cc @WoosukKwon @LucasWilkinson it would be best for you two to review this to ensure that the refactoring fits your design.

mergify · 2025-07-21T16:17:02Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

vllm/v1/attention/backends/flash_attn.py

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

vllm/v1/worker/gpu_model_runner.py

@WoosukKwon

my comments were addressed, but it needs review from others since I'm a co-author:

cc @WoosukKwon @LucasWilkinson it would be best for you two to review this to ensure that the refactoring fits your design.

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

russellb · 2025-07-22T19:24:14Z

vllm/v1/worker/gpu_model_runner.py

+                self.vllm_config, Attention)
+            for layer_name, attn_module in attention_layers.items():
+                if attn_module.attn_type == AttentionType.ENCODER_ONLY:
+                    attn_metadata[layer_name] = encoder_attn_metdata


Suggested change

attn_metadata[layer_name] = encoder_attn_metdata

attn_metadata[layer_name] = encoder_attn_metadata

maxdebayser requested review from DarkLight1337, ywang96, robertgshaw2-redhat, simon-mo, aarnphm, WoosukKwon, njhill, comaniac and alexm-redhat as code owners July 20, 2025 23:14

mergify bot added the v1 label Jul 20, 2025

gemini-code-assist bot reviewed Jul 20, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

noooop mentioned this pull request Jul 21, 2025

[Model] Auto resolve default_pooling_type & Optimize prefix caching enable verify logic. #20930

Open

4 tasks

Merge branch 'upstream_main' into v1_encoder_only

3f11075

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

DarkLight1337 reviewed Jul 21, 2025

View reviewed changes

tests/entrypoints/openai/test_rerank.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Jul 21, 2025

Merge branch 'upstream_main' into v1_encoder_only

a416120

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify bot removed the needs-rebase label Jul 21, 2025

russellb previously requested changes Jul 21, 2025

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

maxdebayser added 2 commits July 21, 2025 14:59

Merge branch 'upstream_main' into v1_encoder_only

d845e22

address review comments

1f3fcc4

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

russellb reviewed Jul 21, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

russellb mentioned this pull request Jul 21, 2025

v1: Add Whisper model support (encoder-decoder) #21088

Draft

1 task

russellb added this to the v0.10.0 milestone Jul 22, 2025

Merge branch 'upstream_main' into v1_encoder_only

85bf5fe

maxdebayser added 3 commits July 21, 2025 22:10

remove sliding window attention case

8e2cba1

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

address review comment

7357614

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

make causal a flag in common attention metadata

aa69e92

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify bot added the speculative-decoding label Jul 22, 2025

russellb reviewed Jul 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support encoder-only models without KV-Cache #21270

Support encoder-only models without KV-Cache #21270

maxdebayser commented Jul 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

maxdebayser commented Jul 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

DarkLight1337 commented Jul 21, 2025 •

edited

Loading

Uh oh!

mergify bot commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

russellb Jul 22, 2025

Uh oh!

Uh oh!

	attn_metadata[layer_name] = encoder_attn_metdata
	attn_metadata[layer_name] = encoder_attn_metadata

Uh oh!

Support encoder-only models without KV-Cache #21270

Are you sure you want to change the base?

Support encoder-only models without KV-Cache #21270

Conversation

maxdebayser commented Jul 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

maxdebayser commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

russellb Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxdebayser commented Jul 20, 2025 •

edited by github-actions bot

Loading

maxdebayser commented Jul 21, 2025 •

edited

Loading

DarkLight1337 commented Jul 21, 2025 •

edited

Loading