-
In the ...
struct ggml_tensor * k =
ggml_view_3d(ctx, kv.k_l[il],
n_embd_head_k, n_kv, n_head_kv,
ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa),
ggml_row_size(kv.k_l[il]->type, n_embd_head_k),
0);
cb(k, "k", il);
// split cached v into n_head heads
struct ggml_tensor * v =
ggml_view_3d(ctx, kv.v_l[il],
n_kv, n_embd_head_v, n_head_kv,
ggml_element_size(kv.v_l[il])*n_ctx,
ggml_element_size(kv.v_l[il])*n_ctx*n_embd_head_v,
0);
cb(v, "v", il);
struct ggml_tensor * kqv = ggml_mul_mat(ctx, v, kq);
cb(kqv, "kqv", il);
struct ggml_tensor * kqv_merged = ggml_permute(ctx, kqv, 0, 2, 1, 3);
cb(kqv_merged, "kqv_merged", il);
... There are two parameters that I am concerned about:
Can anyone help with the above two questions? Thanks a lot. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
Padded values are masked during the attention calculation.
Technically, |
Beta Was this translation helpful? Give feedback.
n_kv
indeed grows gradually, but in chunks of 32 or 256 (determined byllama_kv_cache_get_padding()
):https://github.com/ggerganov/llama.cpp/blob/589b48d41efb0e95133b77c335f4fb9779af9bfb/src/llama.cpp#L17186-L17187
Padded values are masked during the attention calculation.
n_kv
.Technically,
n_kv
could be constant and equal to the maximum KV cache size. But this would make the inference sub-optimal, because we will be attending too many unused KV cells which will increase the computations significantly for no reason. This is why we "truncate" the KV cache from the end:https://github.com/ggerganov…