-
Notifications
You must be signed in to change notification settings - Fork 18
V1 embeddings #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
V1 embeddings #277
Conversation
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Or this can be done with
Now you are good to go 🚀 |
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
The changes introduced by PR vllm-project/vllm#16728 to the sampler architecture were incompatible with our spyre model runner. Initially, as a stopgap solution. I copied the old sampling classes into our vllm_spyre tree just so that we can keep working on the latest changes from main. Now this commit reverts that and makes the same logits processor logic work for the spyre input batch and model runner classes. The difference with the gpu model runner is that in spyre we don't condense the batch but have a boolean mask that is used to calculate "dense" request indices. These indices must be used for the BatchUpdateBuilder because they are the right ones to slice the `logits` tensor that is passed to the Sampler. Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am through 7/10 files, need to continue later... This is a huge PR:)
) | ||
|
||
return model_output | ||
|
||
|
||
class StaticBatchingSpyreModelRunner(SpyreModelRunner): | ||
class WarmupShapesMixin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, might there be another, more elegant way to distribute the method _get_padded_batch_size
to both SpyrePoolingModelRunner
, and StaticBatchingSpyreModelRunner
? Also, the name WarmupShapesMixin
sounds a bit weird to me 😄 . Tagging our most experience SWE here:) @joerunde @tjohnson31415
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, mixin classes are one of the standard ways to solve this problem. But maybe there are better options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it is just that I have never seen them before:) my apologies, learning new things 😄 Naming is fine too, now that I better understand. I initially read it like "Warmup shapes mixing" rather than "Warmup shapes mix-in"...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python doesn't have a standard way to do object composition so there are quite a few different ways to do it, each with their own drawbacks. This is a classic overview worth a read if you have a couple minutes: https://python-patterns.guide/gang-of-four/composition-over-inheritance/
My two cents is that mixins are better than filling up a utils
package with a bunch of shared methods, and then once you have a bunch of mixins you can usually find a pattern to apply instead that doesn't use multiple inheritance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot. will read this in a quiet minute:)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
All tests are passing now after the changes from the first round of reviews. |
Description
This PR enables embedding models on vllm V1. In contrast with the V1 GPU implementation, here I added a separate model runner because for most of the embedding models there is no need for continuous batching. To avoid code repetition, I refactored the input batch and model runner classes into a class hierarchy with common base classes.
@gmarinho2 contributed a test that verifies that the returned embeddings don't change with batch size.