Is there any way to simulate interactive mode behavior? #3500

Bibarius · 2023-10-06T12:48:43Z

Bibarius
Oct 6, 2023

Dear community!
I have just recently started to get acquainted with llama.cpp and it is fantastic!
But there is one question that I have no way of understanding.
I am running llama.cpp built for cpu only on my local laptop. Accordingly, it takes a huge amount of time to run the inference with a given prompt. This behavior is quite understandable to me, since it is cpu built and I don't have the strongest laptop. But when I run interactive mode with prompt templates from the repo to simulate gpt-like chat, a kind of "cold start" takes just as long, but when further chatting in interactive mode, responses are generated just on the fly with almost no delay.

Hence I have two questions

How is this implemented? What is happening under the hood at this moment?
How to achieve this behavior without using interactive mode?

I was trying to figure this out by reading the documentation, and I thought it could be achieved by using "--prompt-cache" and "--prompt-cache-all". But as far as I understand, "--prompt-cache-all" will not be useful in chat simulation mode (the functionality I want to achieve), because after adding a custom prompt to the cached one, it will still re-generate it again, so I won't get as fast as in interactive mode.

I will be glad to get any help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there any way to simulate interactive mode behavior? #3500

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Is there any way to simulate interactive mode behavior? #3500

Uh oh!

Bibarius Oct 6, 2023

Replies: 0 comments

Bibarius
Oct 6, 2023