Questions about kv_cache and Group Query Attention in Llama-7B Model #3485

rikoras · 2023-10-05T07:16:23Z

rikoras
Oct 5, 2023

I am tracking the inference procedure of the Llama-7B model. I found following facts:

With 512 context length, kv cache size = 256MB, which can be calculated by:
$$n_{mem} = {n_{layer}(32)}\times{n_{ctx}(512)}=16384$$
$$n_{elements} = n_{mem}\times n_{embd}(4096) = 67108864$$
$$6410884 * 2\text{(k and v)} * 2(\text{2 bytes for fp16}) = 256MB$$
According to
https://github.com/ggerganov/llama.cpp/blob/019ba1dcd0c7775a5ac0f7442634a330eb0173cc/llama.cpp#L2784-L2813
Each layer has a Kcur [N, n_heads(32), n_embd_head(128)] and a Vcur [N, nheads*n_embd_head(4096)].

In my understanding, Llama leverages GQA to compress the size of kv_cache. But both of the two facts denote that every head has its speciallized kv_cache which is not shared with other heads. Have I misunderstood something?

Furthermore, I noticed that Kcur and Vcur have distinct shapes. I wonder why only Kcur is splited into N*32*128, while Vcur is recorded as 4096*N.

For https://github.com/ggerganov/llama.cpp/blob/019ba1dcd0c7775a5ac0f7442634a330eb0173cc/llama.cpp#L2815-L2823

I also noticed k and v are handled differently. I suspect this might be related to the underlying implementation of tensor storage? I am not familiar with ggml, so this part make me confused.

I would really appreciate an answer!

ggerganov · 2023-10-06T11:02:08Z

ggerganov
Oct 6, 2023
Maintainer

In my understanding, Llama leverages GQA to compress the size of kv_cache. But both of the two facts denote that every head has its speciallized kv_cache which is not shared with other heads. Have I misunderstood something?

With the standard multi-head attention, for each Query head the model has a corresponding Key and Value heads (n_head).
With grouped-query attention, multiple Query heads (n_head) share a single pair of the Key and Value heads (n_head_kv).

As a consequence the KV cache is smaller since we store only the data from the unique KV heads (n_head_kv)

Furthermore, I noticed that Kcur and Vcur have distinct shapes. I wonder why only Kcur is splited into N32128, while Vcur is recorded as 4096*N.

I also noticed k and v are handled differently. I suspect this might be related to the underlying implementation of tensor storage? I am not familiar with ggml, so this part make me confused.

The KV cache in llama.cpp stores V in transposed state. This is done for performance reasons.
You can store it the same way as K, but then you would have to do:

struct ggml_tensor * KQV = ggml_mul_mat(ctx0, ggml_transpose(V), KQ_soft_max);

which is slower than the current:

https://github.com/ggerganov/llama.cpp/blob/f72f8f22c9cb60465b2e79df2767e4ba9604e576/llama.cpp#L2847

1 reply

rikoras Oct 8, 2023
Author

With grouped-query attention, multiple Query heads (n_head) share a single pair of the Key and Value heads (n_head_kv).

Thanks for the clarification!
I checked n_head and n_head_kv. In llama2-7B, the value for both of these variables is 32. Therefore, the 7B model adopts the standard multi-head attention, as indicated in config.json:

"num_key_value_heads": 32,
"num_attention_heads": 32,

Is my understanding correct?

The KV cache in llama.cpp stores V in transposed state. This is done for performance reasons.

This makes things much clearer. Thank you again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about kv_cache and Group Query Attention in Llama-7B Model #3485

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Questions about kv_cache and Group Query Attention in Llama-7B Model #3485

Uh oh!

Uh oh!

rikoras Oct 5, 2023

Replies: 1 comment · 1 reply

Uh oh!

ggerganov Oct 6, 2023 Maintainer

Uh oh!

rikoras Oct 8, 2023 Author

rikoras
Oct 5, 2023

Replies: 1 comment 1 reply

ggerganov
Oct 6, 2023
Maintainer

rikoras Oct 8, 2023
Author