-
I'm working on a chat interface powered by server.cpp (an incredible little tool 👏). I noticed that if I only append to the prompt like the chat example app then it responds very quickly (within 2 seconds on my macbook) as if it is continuing where it left off. BUT if I change the beginning of the prompt it will take a long time to respond (seems linear-ish with the total prompt length). My intuition is that it's caching the state and blowing away the cache if the previous prompt is not a prefix of the new prompt. Can anyone explain what's happening there at a high level? I was trying it with Alpaca style prompt formats where it's common to put conversation history in the |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The server reuses the previously evaluated state if the start of the string is the same. Try to format the prompt in a way that puts the changing parts in the end. Note that the text generated by the model is part of this cache, so that's why append-only is the fastest. Or use a GPU, which can evaluate even long prompts nearly instantly. |
Beta Was this translation helpful? Give feedback.
The server reuses the previously evaluated state if the start of the string is the same. Try to format the prompt in a way that puts the changing parts in the end. Note that the text generated by the model is part of this cache, so that's why append-only is the fastest.
Or use a GPU, which can evaluate even long prompts nearly instantly.