You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to discuss here before requesting it as a new feature or try implementing it.
Server now supports cache parameter. When cache is on, my understanding is that when the new prompt is processed instead of resetting the context, it tries to determine the matching prefix and starts generation from point where the last generation and new prompt defers.
Named Cache:
In addition to default last generation, this new parameter allows creation of cache entry by name. The context from the first generation is saved in L2 cache (eg. memory if free or disk). All subsequent requests with named_cache loads the the context from entries saved earlier, and continues from there.
I might be wrong, my assumptions are, for a long prompt, load from memory or disk is cheaper than evaluating it from start.
If the idea is sound. Any pointers for implementation would be nice. Based on my study so far, saving and loading are straight forward (from save-load-state) and shared states around slots needs to be take care of. My concerns are around continious batching, not sure how this approach affect the continious batching.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I wanted to discuss here before requesting it as a new feature or try implementing it.
Server now supports cache parameter. When cache is on, my understanding is that when the new prompt is processed instead of resetting the context, it tries to determine the matching prefix and starts generation from point where the last generation and new prompt defers.
Named Cache:
In addition to default last generation, this new parameter allows creation of cache entry by name. The context from the first generation is saved in L2 cache (eg. memory if free or disk). All subsequent requests with named_cache loads the the context from entries saved earlier, and continues from there.
I might be wrong, my assumptions are, for a long prompt, load from memory or disk is cheaper than evaluating it from start.
If the idea is sound. Any pointers for implementation would be nice. Based on my study so far, saving and loading are straight forward (from save-load-state) and shared states around slots needs to be take care of. My concerns are around continious batching, not sure how this approach affect the continious batching.
Beta Was this translation helpful? Give feedback.
All reactions