Is it reasonable to add C APIs to get & set kv-cache? #6540
Replies: 1 comment
-
Sorry, I didn't notice the newly added APIs, as below. I'm going to close this discussion now and try to work with these APIs. I'll reopen this if it proves that the APIs I described is indeed necessary. LLAMA_API size_t llama_state_seq_set_data(
struct llama_context * ctx,
const uint8_t * src,
llama_seq_id dest_seq_id);
LLAMA_API size_t llama_state_seq_save_file(
struct llama_context * ctx,
const char * filepath,
llama_seq_id seq_id,
const llama_token * tokens,
size_t n_token_count);
LLAMA_API size_t llama_state_seq_load_file(
struct llama_context * ctx,
const char * filepath,
llama_seq_id dest_seq_id,
llama_token * tokens_out,
size_t n_token_capacity,
size_t * n_token_count_out);
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Greetings!
I'm openning this discussion to ask if it's reasonable to add a C API to get and set
kv_cache
directly. :)Description
Currently, there are following C APIs related with kv-cache (only list part of them here).
llama_model_kv_override
allows users to override the key-value pairs of the model meta data.llama_kv_cache_seq_add
andllama_kv_cache_seq_div
allow users to manipulate the kv-cache directly.llama_copy_state_data
andllama_set_state_data
allow users to get and set the state data, including embeddings, logits and kv-cache.Since there is already an API to get and set the state data, including kv-cache, and the kv-cache does allow to be modified from outside, I wonder if it's possible to add the following three APIs.
It's not a formal proposal because llama.cpp actually uses cells to store the kv-cache and padding is required when setting kv-cache. If it's at least possible to be implemented, I'd like to step further to improve the proposal and contribute.
I have searched the issues and discussions, but I didn't find related topics. I'll appreciate it if you could let me know if there is any similar discussion that I missed.
Application case
As for the application of this API, let's consider a server using batched decoding. At a moment, there are three users online, which are
a
,b
andc
. Therefore, their quiries will be put into one batch to generate the response.Then,
a
andb
log in again after being offline for some time. Besides, at this moment,d
ande
are also online.a
andb
want to continue their previous sessions but no cache was found, whiled
ande
already has some kv-cache in the memory. In this case, to get the best performance, saving kv-cache is a good option. The state data fromllama_copy_state_data
is for all the sequences, includinga
's,b
's andc
's. However, loading the state will remove the state of d and e, and what's more, will load an unexpected state fromc
. Thus, it sounds reasonable to add the API to get and set kv-cache directly.Conclusion
As a conclusion, here are my questions.
kv_cache
directly?Any suggestion will be appreciated!
Best regards,
Rinne
Beta Was this translation helpful? Give feedback.
All reactions