Add support for encoder embedding models #19988

maxdebayser · 2025-06-23T14:42:10Z

This PR is a follow-up to #16188 . It adds support for encoder models which don't have a causal attention mask. It also adds support for models such as cross-encoder/ms-marco-MiniLM-L-6-v2 that require token_type_ids ids to be passed from the tokenizer to the model.

I've also opened a PR with an implementation alternative for the token_type_ids: #20026

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

github-actions · 2025-06-23T14:42:20Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @maxdebayser, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's capability by introducing robust support for encoder embedding models and the necessary token_type_ids they often require. I've adapted the attention mechanisms, KV cache management, and core model logic to seamlessly integrate these new model types, ensuring broader compatibility and efficient execution.

Highlights

Encoder Model Support: I've added comprehensive support for encoder embedding models, which do not utilize a causal attention mask. This expands the range of models that can be efficiently served by the system.
Token Type ID Handling: Full end-to-end support for token_type_ids has been implemented. This is crucial for models like cross-encoder/ms-marco-MiniLM-L-6-v2 that require these IDs to be passed from the tokenizer to the model's forward pass.
FlashAttention Integration: The FlashAttention backend (FlashAttentionImpl) now supports AttentionType.ENCODER_ONLY models by dynamically determining causality based on the attention type. This allows encoder models to leverage the performance benefits of FlashAttention.
KV Cache Adaptations: The KV cache mechanism has been updated to support encoder-only models for prefix caching. Additionally, I've introduced logic to automatically disable features like prefix caching and chunked prefill when non-decoder (e.g., encoder) attention types are detected, as these optimizations are not applicable or can cause issues with non-causal models.
Model Compatibility & Refinements: I've removed the SupportsV0Only interface from BertEmbeddingModel, BertForSequenceClassification, and RobertaForSequenceClassification, signifying their full compatibility with the V1 engine. The RoBERTa embedding layer's position ID handling has also been refined to correctly manage its specific requirements.
Temporary CUDA Graph Limitation: For pooling operations with encoder models, CUDA graphs are temporarily disabled and eager mode is enforced due to current compatibility limitations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR adds support for encoder embedding models and models that require token_type_ids to be passed from the tokenizer to the model. The changes include modifications to the test suite, configuration files, and model execution logic to accommodate these new model types. The code review identified some areas for improvement, including adding more descriptive error messages, making code conditional on model architecture, and simplifying logic.

gemini-code-assist · 2025-06-23T14:45:07Z

tests/models/language/pooling/test_embedding.py

        pytest.param("Alibaba-NLP/gte-Qwen2-1.5B-instruct",
-                     marks=[pytest.mark.skip_v1]),
+                     marks=[pytest.mark.skip_v0]),


Consider removing the skip mark for v0 tests, as it seems the models are now supported in both engines.

gemini-code-assist · 2025-06-23T14:45:07Z

vllm/config.py

+            logger.warning("CUDA graph is not supported for pooling yet, "
+                           "fallback to the eager mode.")
+            self.enforce_eager = True


This warning message and the subsequent setting of self.enforce_eager could be made conditional on the specific model architecture or a more general flag that indicates whether the model is an encoder model. This would avoid unnecessary warnings and eager mode enforcement for models that might not have this issue.

if self.runner_type == "pooling": # WIP: currently cuda graphs are not working for encoder models. logger.warning("CUDA graph is not supported for pooling yet, " "fallback to the eager mode.") self.enforce_eager = True

gemini-code-assist · 2025-06-23T14:45:07Z

vllm/v1/attention/backends/flash_attn.py

+        if attn_type not in [
+                AttentionType.DECODER, AttentionType.ENCODER_ONLY
+        ]:
+            raise NotImplementedError("Encoder/decoder cross-attention "
+                                      "is not implemented for "
                                      "FlashAttentionImpl")


The NotImplementedError message is not very descriptive. It should indicate what types of attention are not implemented, and what attention types are supported.

if attn_type not in [ AttentionType.DECODER, AttentionType.ENCODER_ONLY ]: raise NotImplementedError("FlashAttentionImpl only supports DECODER and ENCODER_ONLY attention types.")

gemini-code-assist · 2025-06-23T14:45:07Z

vllm/v1/engine/core.py

+                if isinstance(spec, AttentionSpec) and \
+                    spec.attn_type != "decoder":
+
+                    logger.info("Found non-decoder layer. Disabling "
+                                "prefix cache and chunked prefill")
+                    self.vllm_config.cache_config.\
+                        enable_prefix_caching = False
+                    self.vllm_config.scheduler_config.\
+                        enable_chunked_prefill = False
+                    self.vllm_config.scheduler_config.\
+                        chunked_prefill_enabled = False
+                    self.vllm_config.scheduler_config.\
+                        long_prefill_token_threshold = 0
+                    break


Consider adding a check to ensure that prefix caching and chunked prefill are not enabled when a non-decoder layer is found. This would provide a more explicit error message to the user, rather than silently disabling the features.

Suggested change

if isinstance(spec, AttentionSpec) and \

spec.attn_type != "decoder":

logger.info("Found non-decoder layer. Disabling "

"prefix cache and chunked prefill")

self.vllm_config.cache_config.\

enable_prefix_caching = False

self.vllm_config.scheduler_config.\

enable_chunked_prefill = False

self.vllm_config.scheduler_config.\

chunked_prefill_enabled = False

self.vllm_config.scheduler_config.\

long_prefill_token_threshold = 0

break

for kv_cache_spec_one_worker in kv_cache_specs:

for _, spec in kv_cache_spec_one_worker.items():

if isinstance(spec, AttentionSpec) and \

spec.attn_type != "decoder":

if self.vllm_config.cache_config.enable_prefix_caching or \

self.vllm_config.scheduler_config.enable_chunked_prefill:

raise ValueError("Prefix caching and chunked prefill are not supported for non-decoder models.")

gemini-code-assist · 2025-06-23T14:45:07Z

vllm/v1/worker/gpu_model_runner.py

                           0,
                           torch.from_numpy(token_indices),
                           out=self.input_ids_cpu[:total_num_scheduled_tokens])
+        if self.input_batch.token_type_ids_cpu_tensor is not None:
+            token_type_ids = torch.index_select(
+                self.input_batch.token_type_ids_cpu_tensor.flatten(), 0,
+                torch.from_numpy(token_indices))
+            # Copy the tensors to the GPU.
+            self.get_token_type_ids()[:total_num_scheduled_tokens]\
+                .copy_(token_type_ids, non_blocking=True)


Consider adding a check to ensure that self.input_batch.token_type_ids_cpu_tensor is not None before attempting to copy the token type IDs to the GPU. This would prevent a potential error if the model does not use token type IDs.

if self.input_batch.token_type_ids_cpu_tensor is not None: token_type_ids = torch.index_select( self.input_batch.token_type_ids_cpu_tensor.flatten(), 0, torch.from_numpy(token_indices)) # Copy the tensors to the GPU. self.get_token_type_ids()[:total_num_scheduled_tokens] .copy_(token_type_ids, non_blocking=True)

gemini-code-assist · 2025-06-23T14:45:08Z

vllm/v1/worker/gpu_model_runner.py

+        has_token_types = self.token_type_ids is not None
+        model_kwargs = {}
+
        if self.is_multimodal_model and get_pp_group().is_first_rank:
            # NOTE(woosuk): To unify token ids and soft tokens (vision
            # embeddings), we always use embeddings (rather than token ids)
            # as input to the multimodal model, even when the input is text.
            input_ids = self.input_ids[:num_scheduled_tokens]
+            if has_token_types:
+                model_kwargs["token_type_ids"] = cast(
+                    torch.Tensor, self.token_type_ids)[:num_scheduled_tokens]


Consider adding a check to ensure that has_token_types is True before attempting to access self.token_type_ids. This would prevent a potential error if the model does not use token type IDs.

input_ids = self.input_ids[:num_scheduled_tokens] if has_token_types: model_kwargs["token_type_ids"] = cast( torch.Tensor, self.token_type_ids)[:num_scheduled_tokens] if mm_embeds:

vllm/v1/worker/tpu_model_runner.py

The @supports_torch_compile decorator was applied only to the encoder stack leaving the embeddings out, which resulted in a dependency on dynamically allocated tensors. Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

russellb · 2025-07-02T14:58:43Z

vllm/v1/worker/gpu_model_runner.py

@@ -2570,25 +2626,28 @@ def get_kv_cache_spec(self) -> dict[str, KVCacheSpec]:
                continue

            # TODO: Support other attention modules, e.g., cross-attention
-            if attn_module.attn_type == AttentionType.DECODER:
+            # encoder only can also benefit from KV cache for prefix caching


My understanding was that attention in an Encoder is (typically?) bidirectional, so that the encoder had to process the entire sequence at once. If that's the case, it doesn't seem like KV cache makes sense here.

I'm hitting this in my work on encoder-decoder support, since the Attention abstraction in V1 assumes KV cache is used for attention, though it's not relevant for the encoder. I don't have it fully sorted out yet.

If I'm missing why it's needed here, could you help me understand? Thanks!

Yes, you're right, it's not necessary. However, last time I looked a lot of places in the scheduler, kv cache manager, model runner and others assume the existence of a kv-cache. So since the KV cache doesn't prevent the encoder from working as long as there is no chunking, I took the pragmatic approach of not disabling it.
But I'll check what progress has been made on that from with the support for attention-free models in V1

I see. When I had KV cache enabled, I was getting different results out of the encoder between V0 and V1. It's very possible the cause was something other than having KV cache enabled, though.

Just to check though, have you validated that the outputs are consistent between V0 and V1?

Yes, I've enabled the tests in tests/models/language/pooling to run with both and also the entrypoints tests. They compare with transformers or sentence-transformers results.

great! I'm sure my problem was something else then.

FWIW, I did just get encoder-without-kv-cache working in my branch here: main...russellb:vllm:v1-whisper

The branch is still WIP, but I was able to verify that the output of the encoder matches between V0 and V1 (at least with whisper)

Cool, I'll try it out. I'm especially keen to see if it improves the performance of encoder models to run without kv-cache.

Sounds good. I haven't measured performance yet. Let me know whatever you find if you try it. I may not have ENCODER_ONLY in all the right spots since I haven't tested that case.

mergify · 2025-07-02T16:17:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser · 2025-07-07T19:16:35Z

@DarkLight1337 , can you enable the full CI tests here to see if something is broken while we wait for reviews? Same for #20026

mergify · 2025-07-08T07:39:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify · 2025-07-14T16:55:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify · 2025-07-16T02:25:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify · 2025-07-16T13:42:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Add support for encoder embedding models

5dee54d

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify bot added v1 tpu Related to Google TPUs labels Jun 23, 2025

gemini-code-assist bot reviewed Jun 23, 2025

View reviewed changes

maxdebayser mentioned this pull request Jun 24, 2025

Add support for encoder embedding models using MultiModal args #20026

Open

russellb reviewed Jun 25, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Show resolved Hide resolved

maxdebayser added 3 commits July 1, 2025 14:12

Fix CUDA graphs for BERT models

7eb9d28

The @supports_torch_compile decorator was applied only to the encoder stack leaving the embeddings out, which resulted in a dependency on dynamically allocated tensors. Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Merge branch 'upstream_main' into v1_embeddings_full

67691e0

Fix cuda graph initialization of token type ids

d3099a9

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser marked this pull request as ready for review July 1, 2025 21:11

maxdebayser requested review from DarkLight1337, ywang96, WoosukKwon, robertgshaw2-redhat, njhill, comaniac and alexm-redhat as code owners July 1, 2025 21:11

russellb reviewed Jul 2, 2025

View reviewed changes

Merge branch 'upstream_main' into v1_embeddings_full

613ff3b

mergify bot added the needs-rebase label Jul 2, 2025

Merge branch 'upstream_main' into v1_embeddings_full

20c41e4

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify bot removed the needs-rebase label Jul 2, 2025

This was referenced Jul 4, 2025

[CI Failure]: Language Models Test (Extended Pooling) #20461

Closed

[Model][3/N] Automatic conversion of CrossEncoding model #20168

Merged

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 8, 2025

mergify bot added the needs-rebase label Jul 8, 2025

Merge branch 'upstream_main' into v1_embeddings_full

ba86026

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify bot removed the needs-rebase label Jul 8, 2025

maxdebayser added 9 commits July 9, 2025 15:36

Fix missing args

b4f5ead

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

relax assertion

c4060d1

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Merge branch 'upstream_main' into v1_embeddings_full

01d2a65

fix missing arg

80930d8

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

fix missing arg

d881f0a

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

remove model from unsupported list

90a25d0

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

fix missing arg

6686550

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Merge branch 'upstream_main' into v1_embeddings_full

cc76777

fix tests

136c9b3

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify bot added the needs-rebase label Jul 14, 2025

Merge branch 'upstream_main' into v1_embeddings_full

b232491

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify bot removed the needs-rebase label Jul 14, 2025

mergify bot added the needs-rebase label Jul 16, 2025

Merge branch 'upstream_main' into v1_embeddings_full

cf5e6b8

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify bot removed the needs-rebase label Jul 16, 2025

fix tests

e19c738

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

maxdebayser requested review from simon-mo and aarnphm as code owners July 16, 2025 13:09

fix tests

e255f30

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify bot added the needs-rebase label Jul 16, 2025

maxdebayser added 2 commits July 16, 2025 12:28

add missing arg

ee5950c

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

Merge branch 'upstream_main' into v1_embeddings_full

78a2e57

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

mergify bot removed the needs-rebase label Jul 16, 2025

Uh oh!

Add support for encoder embedding models #19988

Are you sure you want to change the base?

Add support for encoder embedding models #19988

Conversation

maxdebayser commented Jun 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

russellb Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

russellb Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

russellb Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

russellb Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jul 2, 2025

Uh oh!

maxdebayser commented Jul 7, 2025

Uh oh!

mergify bot commented Jul 8, 2025

Uh oh!

mergify bot commented Jul 14, 2025

Uh oh!

mergify bot commented Jul 16, 2025

Uh oh!

mergify bot commented Jul 16, 2025

Uh oh!

Uh oh!

maxdebayser commented Jun 23, 2025 •

edited by github-actions bot

Loading