Replies: 1 comment 3 replies
-
Ok, maybe it's possible to implement with some help of llama_kv_cache_seq_cp. I'm not completely sure I understand it correctly though. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation: I want to split context in chunks and store each chunk separately for further reloading and also research purposes. A chunk can be seen as a round of interaction between user and assistant, just as an example. In my project, I do lots of "context shifting" by evicting "middle" chunks, the total amount of tokens greatly exceeds the context size that's practically available with my limited GPU resources (around 8192 tokens). Serializing the whole state each time with existing API would be extremely inefficient. There are few possible uses for having precise KV-cache control API, for example:
Beta Was this translation helpful? Give feedback.
All reactions