How to dynamically allocate the maximum request count based on the GPU's compute capability? #6611

allrobot · 2024-04-11T15:40:09Z

allrobot
Apr 11, 2024

.\llama\server.exe -m .\xxxxx.gguf -c 2048 -ngl %ngl% -a xxxxxx --host 127.0.0.1 --parallel 50 --threads-http 50

The GeForce RTX 3080 has 10GB of VRAM. When concurrently running several dozen requests in Python using async, with each request containing tens of thousands of tokens, if the server receives multiple requests at the same time and the GPU compute is fully utilized, will it continue to slowly process a dozen requests?

I'm not particularly familiar with AI technologies. How can one determine, based on the GPU performance, the maximum concurrent request count for async processing when the model is handling several thousand tokens to ensure the model answers questions most efficiently?

Is it necessary for users to manually adjust the maximum number of async requests?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to dynamically allocate the maximum request count based on the GPU's compute capability? #6611

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to dynamically allocate the maximum request count based on the GPU's compute capability? #6611

Uh oh!

allrobot Apr 11, 2024

Replies: 0 comments

allrobot
Apr 11, 2024