Skip to content

Someone can help me to understand the KV cache? #9677

Answered by ggerganov
walker-ai asked this question in Q&A
Discussion options

You must be logged in to vote
  1. The n_kv indeed grows gradually, but in chunks of 32 or 256 (determined by llama_kv_cache_get_padding()):

https://github.com/ggerganov/llama.cpp/blob/589b48d41efb0e95133b77c335f4fb9779af9bfb/src/llama.cpp#L17186-L17187

Padded values are masked during the attention calculation.

  1. The offset is always 0 since we view the KV cache buffers from their beginning up to n_kv.

Technically, n_kv could be constant and equal to the maximum KV cache size. But this would make the inference sub-optimal, because we will be attending too many unused KV cells which will increase the computations significantly for no reason. This is why we "truncate" the KV cache from the end:

https://github.com/ggerganov…

Replies: 1 comment 6 replies

Comment options

You must be logged in to vote
6 replies
@ggerganov
Comment options

@walker-ai
Comment options

@ggerganov
Comment options

@walker-ai
Comment options

@ggerganov
Comment options

Answer selected by walker-ai
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants