Unexpected latency of StarCoder when enable tensor parallel #666
zhaoyang-star
announced in
Q&A
Replies: 1 comment
-
I found that the running requests is ~6 when After I set |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
The latency of StarCoder running on 2 A100 40GB is higher than that running on 1 A100.
While, the latency of LLaMA-13B running on 2 A100 40GB is lower than that running on 1 A100 as expected.
So does MultiQueryAttenion impl in vLLM cause this?
Note: Prompt token length and output token length both are set to 1k.
Beta Was this translation helpful? Give feedback.
All reactions