CPU-based real-time code generation chatbot workload #10910

jackpots28 · 2024-12-20T03:18:46Z

jackpots28
Dec 20, 2024

Hello,

I hope all is well!

Our team is searching for advice on a topic about CPU based inference workloads.

#######################################################
We are trying to determine the best way to simulate a CPU-based real-time code generation chatbot workload when using llama.cpp.

Using this configuration, we are planning that the endpoint will be serving a maximum of 16 requests in parallel with a total KV cache size of 131072 tokens. This means that each request should not exceed 131072 / 16 = 8192 tokens (prompt + completion).

What is your suggestion on the tool to use, and what parameters should be used?

llama-batched-bench
or
llama-bench

We have reviewed your explanation in this post:
https://www.reddit.com/r/LocalLLaMA/comments/1f4bact/llamacpp_parallel_arguments_need_explanation/

We are considering the following to simulate (np >= b >= ub):
llama-batched-bench -m /models/mistral-7b-instruct-v0.3-q8_0.gguf --cont-batching -ngl 0 --prio 0 -t 20 -pg 512,128 -ub 1,2,4 -b 1,2,4,8,10,12 -npl 1,2,4,8,10,12
#######################################################

Thank you for any insight into this matter!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU-based real-time code generation chatbot workload #10910

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

CPU-based real-time code generation chatbot workload #10910

Uh oh!

jackpots28 Dec 20, 2024

Replies: 0 comments

jackpots28
Dec 20, 2024