Replies: 1 comment 1 reply
-
With the standard multi-head attention, for each Query head the model has a corresponding Key and Value heads ( As a consequence the KV cache is smaller since we store only the data from the unique KV heads (
The KV cache in struct ggml_tensor * KQV = ggml_mul_mat(ctx0, ggml_transpose(V), KQ_soft_max); which is slower than the current: https://github.com/ggerganov/llama.cpp/blob/f72f8f22c9cb60465b2e79df2767e4ba9604e576/llama.cpp#L2847 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I am tracking the inference procedure of the Llama-7B model. I found following facts:
With 512 context length, kv cache size = 256MB, which can be calculated by:
$$n_{mem} = {n_{layer}(32)}\times{n_{ctx}(512)}=16384$$
$$n_{elements} = n_{mem}\times n_{embd}(4096) = 67108864$$
$$6410884 * 2\text{(k and v)} * 2(\text{2 bytes for fp16}) = 256MB$$
According to
https://github.com/ggerganov/llama.cpp/blob/019ba1dcd0c7775a5ac0f7442634a330eb0173cc/llama.cpp#L2784-L2813
Each layer has a Kcur [N, n_heads(32), n_embd_head(128)] and a Vcur [N, nheads*n_embd_head(4096)].
In my understanding, Llama leverages GQA to compress the size of kv_cache. But both of the two facts denote that every head has its speciallized kv_cache which is not shared with other heads. Have I misunderstood something?
Furthermore, I noticed that Kcur and Vcur have distinct shapes. I wonder why only Kcur is splited into N*32*128, while Vcur is recorded as 4096*N.
For https://github.com/ggerganov/llama.cpp/blob/019ba1dcd0c7775a5ac0f7442634a330eb0173cc/llama.cpp#L2815-L2823
I also noticed k and v are handled differently. I suspect this might be related to the underlying implementation of tensor storage? I am not familiar with
ggml
, so this part make me confused.I would really appreciate an answer!
Beta Was this translation helpful? Give feedback.
All reactions