Precise decoding progress reporting #8051

vladfaust · 2024-06-21T08:32:29Z

vladfaust
Jun 21, 2024

A llama_decode call may take considerable amount of time, and I'd like to report decoding progress in UI.

What I'm doing now:

Set ggml_backend_sched_set_eval_callback.
Call llama_decode.
Upon receiving a callback with ask = false, I increase some cur_token counter, as I was naively thinking that the number of calls would be equivalent to the decoded prompt's token count.

However, I'm getting the ratio of calls/prompt length to be anywhere from 2 to 3, presumably depending on if the KV cache/prompt size, yet the ratio seems to be consistent in the same circumstances. I tried reading ggml_backend_sched_compute_splits implementation in ggml-backend.c, and I found multiple callback_eval calls. Unfortunately, my knowledge is not enough to determine the relation precisely, which leads to this imperfect approximation.

It'd be great if there was a way to report the decoding progress precisely. What's the math behind these calculations? Or maybe there is a way to filter the callback using the t argument somehow?

Any help is appreciated!

Answered by vladfaust

Jun 24, 2024

FYI, when I changed my logic to always return false from the callback after reading #6576, it fires exactly 2 * token count times (keys + values, I assume?), given that the logits are only calculated for the latest token in the batch. LGTM.

View full answer

vladfaust · 2024-06-24T06:59:52Z

vladfaust
Jun 24, 2024
Author

FYI, when I changed my logic to always return false from the callback after reading #6576, it fires exactly 2 * token count times (keys + values, I assume?), given that the logits are only calculated for the latest token in the batch. LGTM.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Precise decoding progress reporting #8051

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Precise decoding progress reporting #8051

Uh oh!

Uh oh!

vladfaust Jun 21, 2024

Replies: 1 comment

Uh oh!

Uh oh!

vladfaust Jun 24, 2024 Author

vladfaust
Jun 21, 2024

vladfaust
Jun 24, 2024
Author