Parallel Requests support #8567
Unanswered
akhilreddy0703
asked this question in
Q&A
Replies: 2 comments 9 replies
-
Hard to say. Can you run the same test without using the Docker stuff? git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# provide log from this command
LLAMA_DISABLE_LOGS=1 make -j
./llama-server \
-p 8082 \
-m $model_path \
-c 40960 \
--no-mmap \
--threads 48 \
--parallel 100 Also run the following benchmark and post results: ./llama-bench \
-m $model_path \
-p 1,2,4,8,10,16,32,64,100,128 |
Beta Was this translation helpful? Give feedback.
6 replies
-
Hii @akhilreddy0703 @ggerganov, |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @ggerganov and community,
I did one experiment for comparing the capability of handling parallel requests between
TGI and Llamacpp
to understand how many parallel users a single instance can serve.I did a docker deployment of llamacpp server with below configuration for Llama3-8b(int4)
Test details
Additional Info:
With TGI deployment for Llama3-8b(bf16) model,
I've also tested TGI server for ( 1, 3, 10, 30, 100 ) parallel requests, I got approximate ( 10, 9, 7, 5, 3 ) tokens/sec for respective parallel requests
It's an obvious thing that a lower precision (int4) model ideally should get better throughput compared to a half precision(bf16) model.
But, What I observed TGI is serving better as we can see a gradual decrease in throughput as parallel requests increase,
the similar behaviour is not seen with llamacpp.
The Question is:
Why do I see a drastic fall in throughput on llamacpp server which is hosting a quantized model( int4 ) compared to TGI server hosting a bf16 model ? Is it the problem with how llamacpp handles parallelization ?
Anyone explored on the issue for parallelization with llamacpp, I would like to hear your thoughts and please suggest good practices to get the best out of this ?
References:
Thanks in advance,
Akhilreddy G.
Beta Was this translation helpful? Give feedback.
All reactions