Concurrency Limit Issue at 32 with VLLM #3605

timeconnection · 2024-03-25T02:16:25Z

timeconnection
Mar 25, 2024

I am currently leveraging CUDA 11.8 for accelerated inference, with VLLM version 0.3.3+cu118, deploying on an A800 GPU. The primary parameters in use include --gpu 1 --dtype float16 --tokenizer-mode slow --gpu-memory-utilization 0.6 --max-model 8096. However, I've encountered a concurrency limit capped at 32. Adjusting gpu-memory-utilization has not yielded improvements. How can I achieve higher concurrency levels

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Concurrency Limit Issue at 32 with VLLM #3605

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Concurrency Limit Issue at 32 with VLLM #3605

Uh oh!

timeconnection Mar 25, 2024

Replies: 0 comments

timeconnection
Mar 25, 2024