Feature Request: API for store/load raw KV-cache for only a selected sequence of tokens #9427

ghost · 2024-09-11T07:03:23Z

ghost
Sep 11, 2024

Motivation: I want to split context in chunks and store each chunk separately for further reloading and also research purposes. A chunk can be seen as a round of interaction between user and assistant, just as an example. In my project, I do lots of "context shifting" by evicting "middle" chunks, the total amount of tokens greatly exceeds the context size that's practically available with my limited GPU resources (around 8192 tokens). Serializing the whole state each time with existing API would be extremely inefficient. There are few possible uses for having precise KV-cache control API, for example:

Efficient precise rollback, cancelling context shift to the arbitrary state in the past
Efficient "branching" of chunk sequences
Some experiments inspired by https://arxiv.org/pdf/2402.04617 InfLLM which can be seen as "injection" of the most relevant KV-cache blocks in the middle of context

ghost · 2024-09-11T14:00:45Z

ghost
Sep 11, 2024

Ok, maybe it's possible to implement with some help of llama_kv_cache_seq_cp. I'm not completely sure I understand it correctly though.

3 replies

ggerganov Sep 11, 2024
Maintainer

Could you propose an API that would work for your case?

ghost Sep 14, 2024

I'm currently trying llama_state_seq_save_file after seq_add and llama_kv_cache_update. If I understand correctly, if I don't call update, seq_add K-shift will be ignored (I'm trying to shift tokens to zero position before saving). In my case, storing is destructive, so I can afford modifying the saved part without requiring additional context. This all seems a bit error-prone, so I'd propose to add p0+p1 parameters to seq_save/load functions.

ghost Sep 15, 2024

Another problem looks like llama_state_seq_load_file fails due to cell fragmentations as it requires contiguous sequence of free cells, again, if I understand it correctly. I'll try defragmentation, but it's also pretty confusing. Maybe it should allow loading fragmented sequence instead?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: API for store/load raw KV-cache for only a selected sequence of tokens #9427

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature Request: API for store/load raw KV-cache for only a selected sequence of tokens #9427

Uh oh!

Uh oh!

ghost Sep 11, 2024

Replies: 1 comment · 3 replies

Uh oh!

ghost Sep 11, 2024

Uh oh!

ggerganov Sep 11, 2024 Maintainer

Uh oh!

ghost Sep 14, 2024

Uh oh!

ghost Sep 15, 2024

ghost
Sep 11, 2024

Replies: 1 comment 3 replies

ghost
Sep 11, 2024

ggerganov Sep 11, 2024
Maintainer