[Core][Model] PrithviMAE Enablement on vLLM v1 engine (with zero kv_cache_groups) #20577

christian-pinto · 2025-07-07T14:48:37Z

This is a PR to be used for comparison with #20072 where support for attention free models was done via overloading the KVCacheManager.

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

gemini-code-assist

Summary of Changes

Hello @christian-pinto, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances vLLM's capability to support a broader range of models, specifically focusing on attention-free architectures like PrithviMAE. It introduces a streamlined approach for handling models that do not require a KV cache and enables direct processing of raw multimodal inputs, moving away from previous workarounds and integrating these functionalities more natively into the vLLM v1 engine.

Highlights

Attention-Free Model Support: Introduced comprehensive support for attention-free models, such as PrithviMAE, by allowing the vLLM engine to operate with zero KV cache groups. This involves modifications across the KV cache management, scheduler, and engine initialization to bypass KV cache allocation and related operations for these specific models.
Raw Multimodal Input Handling: Implemented a new mechanism for multimodal models to directly process raw input data (e.g., pixel values) without requiring tokenization or embedding lookups. This includes new interfaces (SupportsMultiModalWithRawInput), configuration flags, and adjustments to the model runner to pass raw data directly to the model's forward pass.
PrithviMAE Integration: Integrated the PrithviMAE model as a concrete example of an attention-free, raw multimodal input model. This involved updating its class inheritance, adjusting its multimodal data processing, and adding a dedicated test case to validate its functionality within the vLLM framework.
Conditional Tokenizer Initialization: Modified the engine and processor to conditionally initialize and use the tokenizer, allowing models that do not rely on text tokenization (like PrithviMAE) to skip this step, improving efficiency and flexibility.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

mergify · 2025-07-07T14:49:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @christian-pinto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

The pull request introduces changes to enable PrithviMAE on the vLLM v1 engine with zero kv_cache_groups. The changes include modifications to several files, including adding a new test file, updating the configuration, modifying the model executor, and updating the scheduler. The code changes appear to be well-structured and address the objective of the pull request. There are a few suggestions to improve the code's robustness and readability.

gemini-code-assist · 2025-07-07T14:55:53Z

examples/offline_inference/prithvi_geospatial_mae.py

+            dtype="float16",
+            enforce_eager=True,


Consider using torch.float16 instead of the string literal for specifying the dtype. This approach is more type-safe and avoids potential errors due to typos or inconsistencies.

Suggested change

dtype="float16",

enforce_eager=True,

dtype=torch.float16,

enforce_eager=True,

gemini-code-assist · 2025-07-07T14:55:53Z

vllm/config.py

@@ -1120,10 +1125,10 @@ def get_sliding_window(self) -> Optional[Union[int, list[Optional[int]]]]:
        return self.get_hf_config_sliding_window()

    def get_vocab_size(self) -> int:


Consider providing a default value when using getattr to avoid potential AttributeError if hf_text_config does not have the vocab_size attribute. This makes the code more robust.

Suggested change

def get_vocab_size(self) -> int:

return getattr(self.hf_text_config, "vocab_size", 0)

gemini-code-assist · 2025-07-07T14:55:53Z

vllm/config.py

    def get_hidden_size(self) -> int:
-        return self.hf_text_config.hidden_size
+        return getattr(self.hf_text_config, "hidden_size", 0)


Consider providing a default value when using getattr to avoid potential AttributeError if hf_text_config does not have the hidden_size attribute. This makes the code more robust.

Suggested change

def get_hidden_size(self) -> int:

return self.hf_text_config.hidden_size

return getattr(self.hf_text_config, "hidden_size", 0)

return getattr(self.hf_text_config, "hidden_size", 0)

gemini-code-assist · 2025-07-07T14:55:53Z

vllm/model_executor/models/prithvi_geospatial_mae.py

@@ -169,7 +184,7 @@ def _parse_and_validate_multimodal_data(
        if not isinstance(pixel_values, torch.Tensor):
            raise ValueError(f"Incorrect type of pixel_values. "
                             f"Got type: {type(pixel_values)}")
-        pixel_values = torch.unbind(pixel_values, dim=0)[0]
+        # pixel_values = torch.unbind(pixel_values, dim=0)[0]


This line is commented out. If it's no longer needed, consider removing it to reduce code clutter and improve readability. If it's temporarily disabled for debugging, add a comment explaining why and when it should be re-enabled.

gemini-code-assist · 2025-07-07T14:55:53Z

vllm/v1/worker/gpu_model_runner.py

+            if curr_group_outputs:
+                sanity_check_mm_encoder_outputs(
+                    curr_group_outputs,
+                    expected_num_items=len(grouped_mm_inputs),
+                )

-            for output in curr_group_outputs:
-                encoder_outputs.append(output)
+                for output in curr_group_outputs:
+                    encoder_outputs.append(output)



The code is checking if curr_group_outputs exists before proceeding with sanity checks and processing. However, the else block is empty, which might indicate a missing handling case. If curr_group_outputs is empty, it might be necessary to handle this scenario or log a warning message to indicate that no outputs were generated.

Suggested change

if curr_group_outputs:

sanity_check_mm_encoder_outputs(

curr_group_outputs,

expected_num_items=len(grouped_mm_inputs),

)

for output in curr_group_outputs:

encoder_outputs.append(output)

for output in curr_group_outputs:

encoder_outputs.append(output)

if curr_group_outputs:

sanity_check_mm_encoder_outputs(

curr_group_outputs,

expected_num_items=len(grouped_mm_inputs),

)

for output in curr_group_outputs:

encoder_outputs.append(output)

else:

logger.warning("No encoder outputs generated for this group.")

github-actions · 2025-07-07T15:05:53Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

heheda12345

I checked the code in block_pool.py, kv_cache_coordinator.py, kv_cache_manager.py, kv_cache_utils.py, core.py gpu_input_batch.py and feel that it is possible to implement 0 kv cache config cleanly. I'm not sure whether the changes in scheduler.py is required.

It's quite hard for me to find all code related to kv cache manager. Can you help to put them into a separate PR if you want to continue with 0 kv cache group option?

heheda12345 · 2025-07-08T04:00:02Z

vllm/v1/core/sched/scheduler.py

-                # needed for this request.
-                if self.connector is not None:
-                    self.connector.update_state_after_alloc(
+                if not self.vllm_config.model_config.is_attention_free:


Why don't we need these lines in #20072 ?

Just a wild indentation!

heheda12345 · 2025-07-08T04:04:01Z

vllm/v1/engine/core.py

-            zip(kv_cache_specs, available_gpu_memory)
-        ]
+        #TODO: CP start from here
+        if vllm_config.model_config.is_attention_free:


I prefer to handle attention free model here

vllm/vllm/v1/core/kv_cache_utils.py

Line 927 in 3eb4ad5

def get_kv_cache_config(

. Would be great if it can be achieved by another branch in addtion to is_kv_cache_type_uniform and is_kv_cache_page_size_uniform and pls tell me if further modfications are needed.

heheda12345 · 2025-07-08T04:05:52Z

vllm/v1/worker/gpu_input_batch.py

@@ -295,7 +295,9 @@ def add_request(
        self.num_tokens_no_spec[req_index] = request.num_tokens

        self.num_computed_tokens_cpu[req_index] = request.num_computed_tokens
-        self.block_table.add_row(request.block_ids, req_index)


I think this line does nothing if there is 0 kv cache groups.

heheda12345 · 2025-07-08T04:07:22Z

vllm/v1/core/block_pool.py

@@ -36,7 +36,8 @@ def __init__(
        enable_caching: bool,
        enable_kv_cache_events: bool = False,
    ):
-        assert isinstance(num_gpu_blocks, int) and num_gpu_blocks > 0
+        # num_gpu_blocks can be 0 for attention free models
+        assert isinstance(num_gpu_blocks, int)


I think we can always have at least 1 gpu block so that we don't need to handle null block.

heheda12345 · 2025-07-08T04:08:32Z

vllm/v1/core/kv_cache_utils.py

-                blocks[i].prev_free_block = blocks[i - 1]
-            if i < self.num_free_blocks - 1:
-                blocks[i].next_free_block = blocks[i + 1]
+        # This is 0 in attention free models


These modifications are not needed if with at least 1 gpu block

heheda12345 · 2025-07-08T04:08:57Z

vllm/v1/core/sched/scheduler.py

-        assert num_gpu_blocks is not None and num_gpu_blocks > 0
+
+        # num_gpu_blocks can be zero for attention free models
+        assert num_gpu_blocks is not None


Not needed if with at least 1 gpu block

heheda12345 · 2025-07-08T04:10:04Z

vllm/v1/core/sched/scheduler.py

@@ -246,6 +248,12 @@ def schedule(self) -> SchedulerOutput:
                request.num_tokens, 0)

            while True:
+                # This model is attention free and we do not need to allocate KVCache blocks
+                # for serving requests.
+                if self.vllm_config.model_config.is_attention_free:


Why do you need this change? I think allocate_slots should always succeed,

heheda12345 · 2025-07-08T04:11:44Z

vllm/v1/core/kv_cache_coordinator.py

-        self.verify_and_split_kv_cache_groups()
+        # attention free models are initialized with 0 kv_cache_groups
+        if len(self.kv_cache_config.kv_cache_groups) > 0:
+            self.verify_and_split_kv_cache_groups()


I'm comfortable with adding another coordinator for 0 kv cache groups and re-implement find_longest_cache_hit for it.

As I've observed more and more case that current find_longest_cache_hit can't handle, I'm suggesting a new KVCacheCoordinatorNoPrefixCache and use it when prefix caching is disabled. Can you sync with the author of #20661 to avoid duplication of work?

@christian-pinto I have introduced a KVCacheCoordinatorNoPrefixCache in this PR (#20661 ). I think it should handle your case as well. Could you give it a try?

Hey @nopperl thanks for that. Your approach solves my issue too.

heheda12345 · 2025-07-08T04:13:21Z

vllm/v1/worker/gpu_model_runner.py

@@ -327,6 +329,11 @@ def _may_reorder_batch(self, scheduler_output: "SchedulerOutput") -> None:
        Args:
            scheduler_output: The scheduler output.
        """
+
+        # nothing to be reordered when the mdoel is attention free
+        if self.model_config.is_attention_free:


I'm OK with this change temporarily. I'll refactor this function soon to handle both attention-free case and many other unsupported cases.

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

- Improved formatting around - made is_pooling_model a @Property in ModelConfig Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

- Remove unused functions - merged functions not called anywhere else Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

… maanger. Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

christian-pinto · 2025-07-09T11:14:36Z

@heheda12345 I have followed out your suggestions and instantiated the kv_cache config with 1 block and most of the changes I initially made are not needed. Many thanks!

Also, I have implemented the management for the attention free modes in kv_cache_utils.get_kv_cache_config() as you suggested.

Please have a look at the last commit (645d061) to see all the relevant changes.

If the zero kv_cache groups is the preferred approach compared to the overloading of the KVCacheManager, please let me know and I will have it in a separate branch and open a PR. I keep it here for the time being as it easier for me to test.

heheda12345

Yeah I like the current implementation of 0 kv cache group. Can you make a new PR for that?

heheda12345 · 2025-07-10T14:18:42Z

vllm/v1/core/kv_cache_coordinator.py

-        self.verify_and_split_kv_cache_groups()
+        # attention free models are initialized with 0 kv_cache_groups
+        if len(self.kv_cache_config.kv_cache_groups) > 0:
+            self.verify_and_split_kv_cache_groups()


As I've observed more and more case that current find_longest_cache_hit can't handle, I'm suggesting a new KVCacheCoordinatorNoPrefixCache and use it when prefix caching is disabled. Can you sync with the author of #20661 to avoid duplication of work?

heheda12345 · 2025-07-10T14:21:15Z

vllm/v1/executor/abstract.py

        output = self.collective_rpc("determine_available_memory")
        return output

    def get_kv_cache_specs(self) -> list[dict[str, KVCacheSpec]]:
+        if self.vllm_config.model_config.is_attention_free:
+            return [{"attention_free": KVCacheSpec(block_size=0)}]


Can you just return an empty dict?

christian-pinto added 5 commits July 7, 2025 07:44

Better support for skip_tokenizer_init=True

4c68b8f

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

Support for attention free models in V1

d018203

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

Last few changes after rebasing to latest branch version

dbdb7db

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

Support passing raw multimodal data to model

ad667ce

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

latest changes to align with the original branch

ceedf19

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

christian-pinto requested review from DarkLight1337, ywang96, WoosukKwon, robertgshaw2-redhat, njhill, comaniac, alexm-redhat, simon-mo, youkaichao, mgoin, tlrmchlsmth, houseroad and hmellor as code owners July 7, 2025 14:48

christian-pinto changed the title ~~Open [Core][Model] PrithviMAE Enablement on vLLM v1 engine (with zero kv_cache_groups)~~ [Core][Model] PrithviMAE Enablement on vLLM v1 engine (with zero kv_cache_groups) Jul 7, 2025

mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) v1 labels Jul 7, 2025

gemini-code-assist bot reviewed Jul 7, 2025

View reviewed changes

mergify bot added the needs-rebase label Jul 7, 2025

christian-pinto mentioned this pull request Jul 7, 2025

[Core][Model] PrithviMAE Enablement on vLLM v1 engine #20072

Open

gemini-code-assist bot reviewed Jul 7, 2025

View reviewed changes

heheda12345 reviewed Jul 8, 2025

View reviewed changes

Latest changes to aadpt to upstream master

8e3945b

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

christian-pinto added 5 commits July 9, 2025 10:20

Some reformatting to make the pre-commit hooks succeed

b8f6189

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

Few more changes to solve some other pre-commit hooks failures

3ea3ce0

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

Some style changes

c727744

- Improved formatting around - made is_pooling_model a @Property in ModelConfig Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

Simple code refactoring

eda0697

- Remove unused functions - merged functions not called anywhere else Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

Support for attention free models revisited to reuse existing KVCache…

645d061

… maanger. Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

christian-pinto force-pushed the prithvi_v1_embeddings_zero_kv_cache_group branch from e0dd56a to 645d061 Compare July 9, 2025 11:08

mergify bot added the new-model Requests to new models label Jul 10, 2025

heheda12345 reviewed Jul 10, 2025

View reviewed changes

heheda12345 mentioned this pull request Jul 10, 2025

[V1] Hybrid allocator without prefix caching #20661

Merged

4 tasks

christian-pinto mentioned this pull request Jul 11, 2025

[v1][core]Support for attention free models #20811

Open

		@@ -1120,10 +1125,10 @@ def get_sliding_window(self) -> Optional[Union[int, list[Optional[int]]]]:
		return self.get_hf_config_sliding_window()

		def get_vocab_size(self) -> int:

	def get_vocab_size(self) -> int:
	return getattr(self.hf_text_config, "vocab_size", 0)

Uh oh!

[Core][Model] PrithviMAE Enablement on vLLM v1 engine (with zero kv_cache_groups) #20577

Are you sure you want to change the base?

[Core][Model] PrithviMAE Enablement on vLLM v1 engine (with zero kv_cache_groups) #20577

Conversation

christian-pinto commented Jul 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

mergify bot commented Jul 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 7, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christian-pinto commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

christian-pinto commented Jul 7, 2025 •

edited by github-actions bot

Loading

christian-pinto commented Jul 9, 2025 •

edited

Loading