-
A What I'm doing now:
However, I'm getting the ratio of calls/prompt length to be anywhere from 2 to 3, presumably depending on if the KV cache/prompt size, yet the ratio seems to be consistent in the same circumstances. I tried reading It'd be great if there was a way to report the decoding progress precisely. What's the math behind these calculations? Or maybe there is a way to filter the callback using the Any help is appreciated! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
FYI, when I changed my logic to always return |
Beta Was this translation helpful? Give feedback.
FYI, when I changed my logic to always return
false
from the callback after reading #6576, it fires exactly2 * token count
times (keys + values, I assume?), given that the logits are only calculated for the latest token in the batch. LGTM.