Replies: 1 comment
-
With CPU is expected to have a slower inference time, but you can enable parallel requests by setting the environment variable appropriately: Line 72 in 39a6b56 However that would probably not work very well with CPUs. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have deploy the latest localAI container (v2.7.0) on a 2 x Xeon 2680V4 (56 threads in total) with 198 GB of RAM, but from what I can tell, the request hitting the /v1/chat/completions endpoint, are being processed one after the other, not in parallel.
I believe there are enough resources on my system to process these requests in parallel.
I am using the openhermes-2.5-mistral-7b.Q8_0.gguf model.
To get a response for the curl example below, it takes about 20+ seconds. Is this good or bad for the system I have?
Thanks
PS: I can't install a GPU on this system as it is a 1U unit.
Curl:
Docker-compose:
LocalAI logs:
Beta Was this translation helpful? Give feedback.
All reactions