-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
[Core][Model] PrithviMAE Enablement on vLLM v1 engine #20072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Core][Model] PrithviMAE Enablement on vLLM v1 engine #20072
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @christian-pinto, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly extends vLLM's capabilities by enabling the Prithvi MAE Geospatial model on the V1 engine. It introduces foundational changes to support multi-modal models that require raw input data, streamlines the handling of attention-less models by providing a dummy KV cache mechanism, and refines the engine's behavior for models that do not require tokenizer initialization.
Highlights
- Multi-modal Support: Added support for multi-modal models that consume raw input data (e.g., image tensors) directly, rather than pre-computed embeddings. This includes new interfaces (
SupportsMultiModalWithRawInput
) and integration points in the model runner to pass raw data to the model's forward pass. - Attention-Free Models: Introduced a
DummyKVCacheManager
and updated the scheduler and engine initialization to properly handle models that do not require a KV cache (attention-free models). This prevents unnecessary resource allocation and potential hangs by bypassing KV cache operations for such models. - Tokenizer Initialization: Enhanced support for models that specify
skip_tokenizer_init=True
, ensuring that tokenizer-related operations are conditionally bypassed in various parts of the engine (e.g.,LLMEngine
,Processor
,OutputProcessor
,StructuredOutput
), improving flexibility for models that don't rely on a traditional tokenizer. - PrithviMAE Integration: Integrated the Prithvi MAE Geospatial model into the vLLM V1 engine, including updates to its model definition, multi-modal data processing (e.g., using
MultiModalFieldConfig.shared
and adjusting dummy data types), and an example inference script.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables the Prithvi MAE Geospatial model on the vLLM V1 engine by adding support for multi-modal models that require raw multi-modal data, improving support for models that use skip_tokenizer_init=True
, and enhancing support for attention-less models. The changes include modifications to the PrithviMAE model, example usage, configuration, and core components of vLLM. The changes look good, but I've added some suggestions for improved comments.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
ca85b88
to
c7f80c0
Compare
vllm/config.py
Outdated
self.is_pooling_model = self.registry.is_pooling_model( | ||
self.architectures) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a @property
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, no problem
@@ -64,7 +65,130 @@ def new_empty(self) -> "KVCacheBlocks": | |||
return KVCacheBlocks(tuple([] for _ in range(len(self.blocks)))) | |||
|
|||
|
|||
class KVCacheManager: | |||
class KVCacheManagerInterface(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether this conforms with V1 design. cc @maxdebayser @heheda12345 @WoosukKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have used the same interface template used for the Scheduler in v1. If another way of defining an interface is the preferred one, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkLight1337 , by V1 design do you mean "the right way of defining interfaces" or whether the KVCacheManager should be pluggable or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, is there anyone working on pure SSM model support in V1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkLight1337 , by V1 design do you mean "the right way of defining interfaces" or whether the KVCacheManager should be pluggable or not?
Yeah, I'm not sure which parts of KVCacheManager should be customizable.
BTW, is there anyone working on pure SSM model support in V1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have investigated option 1 and it would require a number of small changes in the KVCacheManager, KVCacheCoordinator, etc. The logic is relying on the fact that there are > 0 kv_cache_group. I am not sure we end-up with a cleaner implementation. By overloading instead we just get rid of all the logic that would otherwise be useless when a KVCache is not used.
These are my two cents but I am happy to go for the solution that is more in line with the strategy for the KVCacheManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@heheda12345 any further comments on this? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want minor changes do you need? I feel it would be quite clean to treat it as 0 groups but maybe I've missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I have tried to do is to initialize the kv cache manager with the below config
KVCacheConfig(num_blocks=0,
kv_cache_tensors=[],
kv_cache_groups=[])
]
This however it is not enough for the kvcache manager to function with 0 blocks.
I ended up having to modify also:
BlockPool
in v1.core.block_pool.py
because it uses a double linked (FreeKVCacheBlockQueue
) list to track the free blocks. Such list is not designed to work with an empty list of kvcache blocks.
Scheduler
because on each new request it invokes the allocate_blocks
from the kvcache manager and if no blocks are returned requests are unschedulable. Some changes are then requires to make sure that requests with no associated block ids are schedulable.
I also had to do changes around to make sure the logic of the KV cache would work with no KV cache groups.
For simplicity I have tried implementing the other approach in this PR: #20577
I still remain of the idea that initializing all the objects that come with the KVCacheManager (KVCacheCoordinator, BLockPool, FreeKVCachePool, etc.) for no reason is less clean than overloading the KVCacheManager
.
I might also have misunderstood your suggestions @heheda12345 on using 0 kv cache groups so please please take a look at his new PR.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I've added some comments in #20577 and feel that 0 kv cache group is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces some changes that could be useful for future changes such as the implementation of pure SSM models and encoder embedding models. Currently the design of multi-modal inputs as embeddings seems pretty restrictive to me, but I don't know enough about all the design considerations that went into it. So to my superficial level of knowledge about the multi modal models, the support for raw inputs seems pretty useful.
# `kv_cache_config` to be consistent across all workers to make sure | ||
# all the memory operators can be applied to all workers. | ||
unify_kv_cache_configs(kv_cache_configs) | ||
if vllm_config.model_config.is_attention_free: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we perhaps generalize the name of this property to indicate that it's kv-cache free? Technically the encoder embedding models don't need a kv-cache but they do use attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally create a second flag that is a subset of this one to just disable KV Caching. In the end it is fine that for attention free models neither an attention backend nor a KVCache are instantiated.
At the moment I see that the initialization of the attention backend is done as part of initializing the KV Cache.
I believe this should be done in another PR, though. the changes here were simply because the is_attention_free flag was not fully working when I tested it.
vllm/v1/worker/gpu_model_runner.py
Outdated
@@ -547,6 +554,55 @@ def _update_states(self, scheduler_output: "SchedulerOutput") -> None: | |||
if batch_changed or batch_reordered: | |||
self.input_batch.refresh_sampling_metadata() | |||
|
|||
def _add_multimodal_inputs_to_model_args( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this function is only called in _maybe_add_multimodal_kwargs
perhaps it's better to merge them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, makes sense. Thanks
vllm/v1/worker/gpu_model_runner.py
Outdated
scheduler_output, | ||
num_reqs) | ||
|
||
def _maybe_compute_attn_prefix( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this function? It looks like an extension point for a subclass but it's not called anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just something I have brought in when I started working on this based on support for embeddings on v1 before it got merged. I will remove it.
vllm/v1/worker/gpu_model_runner.py
Outdated
) -> list[int]: | ||
return [0] * len(self.kv_cache_config.kv_cache_groups) | ||
|
||
def _maybe_prepare_additional_inputs(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above 😄
scheduler_config=self.vllm_config.scheduler_config, | ||
lora_config=self.vllm_config.lora_config, | ||
).get_lora_tokenizer(None) | ||
self.tokenizer = None if vllm_config.model_config.skip_tokenizer_init \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this related to this PR?
btw I we can't skip tokenizer init here given that tokenizer is required for all of the structured outputs backend. therefore, this shouldn't be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this was one quick and dirty change that I made and forgot to sanitize.
Since the tokenizer is required for structured output generation the output manager should not be initialized at all when we skip tokenizer init. Also, all the call to the structured output manager should be guarded against it.
Let me give this a go and you can have a look at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: does this work with transformers implementation?
Ah i wad referring to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For models that doesn't support structured outputs, I think we should perform the gate at before server startup time.
For structured outputs overhead, I don't think skip_tokenizer_init
is relevant here given that the manager initialize the tokenizer itself.
This pull request has merge conflicts that must be resolved before it can be |
# the input prompt we pass one token and the size of the dummy | ||
# embedding tensors must reflect that. | ||
return torch.empty(input_ids.shape) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the get_multimodal_embeddings()
function here. How is this not failing in GPUModelRunner.profile_run()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that memory profiling is only done while initializing the KVCache, which is skipped because the model is attention free.
It should happen in Worker.determine_available_memory()
if I am not mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but if we're adding a new programming model for MM inputs I think it would be nice to handle this. If not a full solution, perhaps a test to disable cuda graphs and memory profiling for this special case.
But I'm thinking that maybe the raw MM inputs will be hard to generalize. Perhaps a hacky solution for this particular model would be to store the raw MM inputs in the get_multimodal_embeddings()
method and then use them when forward()
is called.
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
- Improved formatting around - made is_pooling_model a @Property in ModelConfig Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
- Remove unused functions - merged functions not called anywhere else Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
30ed44e
to
aab8956
Compare
Rebased on the latest master that includes @aarnphm changes to the structured output manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I did another round of review for other code misc to reduce general diff. I don't have an opinion for the KVCache part, so will leave that to others to comment on it.
vllm/v1/engine/output_processor.py
Outdated
parent_req=parent_req, | ||
request_index=request_index, | ||
queue=queue, | ||
log_stats=self.log_stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log_stats=self.log_stats) | |
log_stats=self.log_stats,) |
to reduce diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
vllm/v1/engine/processor.py
Outdated
if tokenizer: | ||
max_input_id = max(prompt_ids, default=0) | ||
if max_input_id > tokenizer.max_token_id: | ||
raise ValueError( | ||
f"Token id {max_input_id} is out of vocabulary") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
if tokenizer: | |
max_input_id = max(prompt_ids, default=0) | |
if max_input_id > tokenizer.max_token_id: | |
raise ValueError( | |
f"Token id {max_input_id} is out of vocabulary") | |
if tokenizer and (max_input_id:=max(prompt_ids, default=0)) > tokenizer.max_token_id: | |
raise ValueError(f"Token id {max_input_id} is out of vocabulary") |
use_eagle=self.use_eagle, | ||
log_stats=self.log_stats, | ||
enable_kv_cache_events=self.enable_kv_cache_events, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert back self.use_pp = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, thank you.
vllm/v1/core/sched/scheduler.py
Outdated
req_to_new_block_ids[request.request_id] = \ | ||
self.kv_cache_manager.get_block_ids(request.request_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this style change. We will run ruff format at some point before 0.10
vllm/v1/core/sched/scheduler.py
Outdated
# num_gpu_blocks can be ero for attention free models | ||
assert num_gpu_blocks is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# num_gpu_blocks can be ero for attention free models | |
assert num_gpu_blocks is not None | |
# num_gpu_blocks can be zero for attention free models | |
assert num_gpu_blocks is not None |
if model_config.skip_tokenizer_init: | ||
tokenizer = None | ||
else: | ||
tokenizer = self.tokenizer.get_lora_tokenizer(lora_request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this look for diff
if model_config.skip_tokenizer_init: | |
tokenizer = None | |
else: | |
tokenizer = self.tokenizer.get_lora_tokenizer(lora_request) | |
tokenizer = None if model_config.skip_tokenizer_init \ | |
else self.tokenizer.get_lora_tokenizer(lora_request) |
vllm/v1/worker/gpu_model_runner.py
Outdated
|
||
# nothing to be reordered when the mdoel is attention free | ||
if self.model_config.is_attention_free: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return False | |
return |
vllm/v1/worker/gpu_model_runner.py
Outdated
def _maybe_add_multimodal_kwargs( | ||
self, | ||
model_kwargs: dict[str, Any], | ||
scheduler_output: "SchedulerOutput" = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheduler_output: "SchedulerOutput" = None, | |
scheduler_output: "Optional[SchedulerOutput]" = None, |
vllm/v1/worker/gpu_model_runner.py
Outdated
intermediate_tensors=intermediate_tensors, | ||
inputs_embeds=inputs_embeds, | ||
**MultiModalKwargs.as_kwargs( | ||
model_kwargs, device=self.device)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model_kwargs, device=self.device)) | |
model_kwargs, device=self.device),) |
To also reduce the diff
|
||
def get_computed_blocks(self, | ||
request: Request) -> tuple[KVCacheBlocks, int]: | ||
return (KVCacheBlocks([]), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your expect behavior of prefix caching? Seems that someone want prefix caching to be enabled in "last" pooling method.
Line 4594 in 8aeaa91
if pooling_type is None or pooling_type.lower() != "last": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the KVCacheManager modification necessary to this PR? As #16188 is runnable, is it possible to only include other necessary changes in this PR and split the KVCacheManager changes to another PR? It can both unblock this PR and help me to review KVCacheManager changes easier.
@@ -64,7 +65,130 @@ def new_empty(self) -> "KVCacheBlocks": | |||
return KVCacheBlocks(tuple([] for _ in range(len(self.blocks)))) | |||
|
|||
|
|||
class KVCacheManager: | |||
class KVCacheManagerInterface(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I've added some comments in #20577 and feel that 0 kv cache group is possible.
Hi @heheda12345, unfortunately these changes are necessary or otherwise I will have to hack around a number of things including the profiling run for memory profiling as this is not a language model so profiling as it is does not work. These changes would be totally useless because I do not need any of this profiling to take place as this model does not use attention. Therefore I would endup making some changes that would be there solely to support my model until attention free models are supported and then revert everything back as it was. To help you in reviewing the two options, in the new PR (the one using zero blocks) the changes for supporting attention free models are only in the last commit. Similarly in this PR all changes related to supporting attention free are in commits 953f66a and 953f66a |
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
This pull request has merge conflicts that must be resolved before it can be |
Purpose
This PR enables the Prithvi MAE Geospatial model on the vLLM V1 engine. This is a result of pooling/embedding models being enabled in V1.
The main changes in this PR are:
skip_tokenizer_init=True
DummyKVCacheManager
classTest Plan
I have added a test script that is only testing the with vLLM runner as this model is not on HF. Therefore it would only check if the inference takes place correctly for the model.
Test Result
The test is not succeeding at the moment because the pytest process hangs indefinitely while initializing the vlLM engine. The test hangs indefinitely on
wait_for_engine_startup
. The same code tested outside of pytest works just fine. Any help on getting this test running would be much appreciated.Documentation Update
Updated examples file.
@DarkLight1337 @maxdebayser @njhill