No context swaps for llama-server #8310

Tureti · 2024-07-04T22:22:51Z

Tureti
Jul 4, 2024

Hey, I currently use llama-server with Open-WebUI. When the context is long, I have to wait a long time before it starts writing an answer every time I ask something. This is not the case with llama-cli in conversation mode as follow up questions don't make the model reprocess the whole prompt.
Is there an option to enable the same thing for llama-server? Or is it not possible?

Edit: #4347 seems to be the fix but can I specify cache_prompt=True when starting the server or do I need to send it with every request?

Answered by ggerganov

Jul 5, 2024

You need to send the cache_prompt with every request

View full answer

Tureti · 2024-07-04T22:52:33Z

Tureti
Jul 4, 2024
Author

For now I just changed the default value for cache_prompt to true and rebuild it since I can't find which parameter to pass to llama-server for it.

1 reply

ggerganov Jul 5, 2024
Maintainer

You need to send the cache_prompt with every request

Answer selected by Tureti

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

No context swaps for llama-server #8310

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

No context swaps for llama-server #8310

Uh oh!

Uh oh!

Tureti Jul 4, 2024

Replies: 1 comment · 1 reply

Uh oh!

Tureti Jul 4, 2024 Author

Uh oh!

ggerganov Jul 5, 2024 Maintainer

Tureti
Jul 4, 2024

Replies: 1 comment 1 reply

Tureti
Jul 4, 2024
Author

ggerganov Jul 5, 2024
Maintainer