Replies: 1 comment 1 reply
-
I found that each time it computes rope with ne[2] = 32768 each time which makes computation overhead. I modified it into this and it works fine now static struct ggml_tensor * llm_build_kqv(
struct ggml_context * ctx,
const llama_model & model,
const llama_hparams & hparams,
const llama_cparams & cparams,
const llama_kv_cache & kv,
struct ggml_cgraph * graph,
struct ggml_tensor * wo,
struct ggml_tensor * wo_b,
struct ggml_tensor * q_cur,
struct ggml_tensor * kq_mask,
int32_t n_tokens,
int32_t n_kv,
float kq_scale,
const llm_build_cb & cb,
int il) {
const int64_t n_ctx = cparams.n_ctx;
const int64_t n_head = hparams.n_head;
const int64_t n_head_kv = hparams.n_head_kv;
const int64_t n_embd_head_k = hparams.n_embd_head_k;
const int64_t n_embd_k_gqa = hparams.n_embd_k_gqa();
const int64_t n_embd_head_v = hparams.n_embd_head_v;
const int64_t n_embd_v_gqa = hparams.n_embd_v_gqa();
struct ggml_tensor * q = ggml_permute(ctx, q_cur, 0, 2, 1, 3);
cb(q, "q", il);
struct ggml_tensor * k;
if(cparams.pre_rope_cache && (kv.type_k==GGML_TYPE_F32 || kv.type_k==GGML_TYPE_F16)){
k = ggml_view_3d(ctx, kv.k_l[il], n_embd_head_k, n_head_kv, n_kv,
ggml_row_size(kv.k_l[il]->type, n_embd_head_k),
ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa),
0);
k = ggml_rope_ext(
ctx, k, nullptr, nullptr,
hparams.n_rot, hparams.rope_type, 0, cparams.n_yarn_orig_ctx,
cparams.rope_freq_base, cparams.rope_freq_scale,
cparams.yarn_ext_factor, cparams.yarn_attn_factor,
cparams.yarn_beta_fast, cparams.yarn_beta_slow
);
// k->ne[2] = kv.size;
k = ggml_view_3d(ctx, k,
n_embd_head_k, n_kv, n_head_kv,
ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa),
ggml_row_size(kv.k_l[il]->type, n_embd_head_k),
0);
}else{
k = ggml_view_3d(ctx, kv.k_l[il],
n_embd_head_k, n_kv, n_head_kv,
ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa),
ggml_row_size(kv.k_l[il]->type, n_embd_head_k),
0);
}
cb(k, "k", il);
// skip below
} I don't know why it combines |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi~ I'm just trying to store pre-rope key cache in order to make quantization better.
Thus, I skip
ggml_rope_ext
before it stores then modify something withinllm_build_kqv
(model: llama3)I also modify GGML_OP of rope that would re-compute rope for all kv cache that used and I have checked this would work.
Unfortunately, when I try large context length (e.g. 32768) the performance would dramatically decline (from 11 tokens/s -> 1 token/s). The metrics is measured in first few tokens.
here's
ggml_compute_forward_rope_f32
belowI only make array takes index for position when computing rope for key cache (inp_pos would be passed as
nullptr
)The bottleneck is
ggml_rope_ext
and if context length set smaller (e.g. 512), the performance would not decline so much(context length I mean is set by
-c
)It should slow down for re-computing rope of each token but it is really slow when it just generate first few tokens.
I think first few tokens would not be impacted so much cause it only increase some computation initially.
Beta Was this translation helpful? Give feedback.
All reactions