You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear community!
I have just recently started to get acquainted with llama.cpp and it is fantastic!
But there is one question that I have no way of understanding.
I am running llama.cpp built for cpu only on my local laptop. Accordingly, it takes a huge amount of time to run the inference with a given prompt. This behavior is quite understandable to me, since it is cpu built and I don't have the strongest laptop. But when I run interactive mode with prompt templates from the repo to simulate gpt-like chat, a kind of "cold start" takes just as long, but when further chatting in interactive mode, responses are generated just on the fly with almost no delay.
Hence I have two questions
How is this implemented? What is happening under the hood at this moment?
How to achieve this behavior without using interactive mode?
I was trying to figure this out by reading the documentation, and I thought it could be achieved by using "--prompt-cache" and "--prompt-cache-all". But as far as I understand, "--prompt-cache-all" will not be useful in chat simulation mode (the functionality I want to achieve), because after adding a custom prompt to the cached one, it will still re-generate it again, so I won't get as fast as in interactive mode.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Dear community!
I have just recently started to get acquainted with llama.cpp and it is fantastic!
But there is one question that I have no way of understanding.
I am running llama.cpp built for cpu only on my local laptop. Accordingly, it takes a huge amount of time to run the inference with a given prompt. This behavior is quite understandable to me, since it is cpu built and I don't have the strongest laptop. But when I run interactive mode with prompt templates from the repo to simulate gpt-like chat, a kind of "cold start" takes just as long, but when further chatting in interactive mode, responses are generated just on the fly with almost no delay.
Hence I have two questions
I was trying to figure this out by reading the documentation, and I thought it could be achieved by using "--prompt-cache" and "--prompt-cache-all". But as far as I understand, "--prompt-cache-all" will not be useful in chat simulation mode (the functionality I want to achieve), because after adding a custom prompt to the cached one, it will still re-generate it again, so I won't get as fast as in interactive mode.
I will be glad to get any help!
Beta Was this translation helpful? Give feedback.
All reactions