-
Notifications
You must be signed in to change notification settings - Fork 18
[CB][do not merge] Support batch size 1 for decode, simplify warmup #312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>
👋 Hi! Thank you for contributing to vLLM support on Spyre.
Or this can be done with
Now you are good to go 🚀 |
bot:test |
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>
bot test failed in warmup decode. |
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>
bot:test |
Note: CPU failures is expected for BS 1 (didnt adapt warmup as in #287 ) Spyre card: reverting the warmup changes results in a runtime error: |
Looks like batch size 1 for decode is not supported by the compiler yet... Priority of this is low as performance advantage is marginal paired with a limited use case. |
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>
update: i tried Joshs suggestion, so far without success |
[CB][do not merge] Support batch size 1 for decode, simplify warmup
As we now switched to torch 2.7.1 in #307 , dynamic dimension of size 1 are supported by torch. Hence, batch size 1 for decode should produce the same graph as batch size >= 2. This PR relaxes the constraint and adapts the warmup. To be tested on the card. if working, this makes #287 redundant.