Question about recomputation #1740
Closed
puddingfjz
announced in
Q&A
Replies: 1 comment
-
There is BlockDiagonalCausalMask.from_seqlens() as the attention bias in the prompt stage. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have a question about why recomputation can be implemented by converting the prompt tokens and the generated tokens into a new prompt and running one prefill stage computation.
For example, assume the prompt tokens are [t1, t2] and the generated tokens are [t3, t4]. I think the KV cache for t3 is computed based on the hidden states of t3 in each layer, which is affected by t1 and t2, but not by t4.
However, if we use a new prompt [t1, t2, t3, t4] and run the prefill stage computation, the KV cache of t3 will be affected by t4.
I am really confused here and hope someone can help with this.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions