Is decoding one token or two token at a time in llama.cpp? #13198
Unanswered
afsara-ben
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
As in decode stage per token output is calculated, shouldn't the attn output layer compute be of shapes (1, 1, 4096) @ (1, 4096, 4096)? But from observing the layer wise shapes, why is

src1
obtained from last layer has shape[ 4096, 2, 1, 1]
instead of[ 4096, 1, 1, 1]
- what is the other token? why does kqv_out have 2 tokens instead of 1 in the decoding stage? is it a batched inference?Beta Was this translation helpful? Give feedback.
All reactions