CPU-based real-time code generation chatbot workload #10910
Unanswered
jackpots28
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I hope all is well!
Our team is searching for advice on a topic about CPU based inference workloads.
#6510
#######################################################
We are trying to determine the best way to simulate a CPU-based real-time code generation chatbot workload when using llama.cpp.
Using this configuration, we are planning that the endpoint will be serving a maximum of 16 requests in parallel with a total KV cache size of 131072 tokens. This means that each request should not exceed 131072 / 16 = 8192 tokens (prompt + completion).
What is your suggestion on the tool to use, and what parameters should be used?
llama-batched-bench
or
llama-bench
We have reviewed your explanation in this post:
https://www.reddit.com/r/LocalLLaMA/comments/1f4bact/llamacpp_parallel_arguments_need_explanation/
We are considering the following to simulate (np >= b >= ub):
llama-batched-bench -m /models/mistral-7b-instruct-v0.3-q8_0.gguf --cont-batching -ngl 0 --prio 0 -t 20 -pg 512,128 -ub 1,2,4 -b 1,2,4,8,10,12 -npl 1,2,4,8,10,12
#######################################################
Thank you for any insight into this matter!
Beta Was this translation helpful? Give feedback.
All reactions