Managing Performance (token/s) #9665
gotDaijobu
announced in
Q&A
Replies: 2 comments 1 reply
-
It's probably memory bandwidth bound. Have you tried increasing the request concurrency? |
Beta Was this translation helpful? Give feedback.
1 reply
-
So I've done further tests : when I have several requests in parallel, the performances drops. I am struggling to find the cause. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm using vllm for inference. I gave it several tries and I'm currently running a Llama3.1 70B quantized model (bitsandbytes, 4bits) on a single A100 (80GB).
I made quite a lot of tests and I realized that decreasing (or increasing) the gpu_memory_utilization has no effect on the throughput (in tokens/seconds, i'm stuck at 13tps).
So the bottleneck is elsewhere but I can't figure where.
Any idea ?
Beta Was this translation helpful? Give feedback.
All reactions