Named cache in /completion #4649

aniljava · 2023-12-27T07:32:38Z

aniljava
Dec 27, 2023

I wanted to discuss here before requesting it as a new feature or try implementing it.

Server now supports cache parameter. When cache is on, my understanding is that when the new prompt is processed instead of resetting the context, it tries to determine the matching prefix and starts generation from point where the last generation and new prompt defers.

Named Cache:
In addition to default last generation, this new parameter allows creation of cache entry by name. The context from the first generation is saved in L2 cache (eg. memory if free or disk). All subsequent requests with named_cache loads the the context from entries saved earlier, and continues from there.

I might be wrong, my assumptions are, for a long prompt, load from memory or disk is cheaper than evaluating it from start.

If the idea is sound. Any pointers for implementation would be nice. Based on my study so far, saving and loading are straight forward (from save-load-state) and shared states around slots needs to be take care of. My concerns are around continious batching, not sure how this approach affect the continious batching.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Named cache in /completion #4649

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Named cache in /completion #4649

Uh oh!

aniljava Dec 27, 2023

Replies: 0 comments

aniljava
Dec 27, 2023