Concurrency Limit Issue at 32 with VLLM #3605
timeconnection
announced in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am currently leveraging CUDA 11.8 for accelerated inference, with VLLM version 0.3.3+cu118, deploying on an A800 GPU. The primary parameters in use include --gpu 1 --dtype float16 --tokenizer-mode slow --gpu-memory-utilization 0.6 --max-model 8096. However, I've encountered a concurrency limit capped at 32. Adjusting gpu-memory-utilization has not yielded improvements. How can I achieve higher concurrency levels

Beta Was this translation helpful? Give feedback.
All reactions