Question about recomputation #1740

puddingfjz · 2023-11-21T11:33:10Z

puddingfjz
Nov 21, 2023

I have a question about why recomputation can be implemented by converting the prompt tokens and the generated tokens into a new prompt and running one prefill stage computation.
For example, assume the prompt tokens are [t1, t2] and the generated tokens are [t3, t4]. I think the KV cache for t3 is computed based on the hidden states of t3 in each layer, which is affected by t1 and t2, but not by t4.
However, if we use a new prompt [t1, t2, t3, t4] and run the prefill stage computation, the KV cache of t3 will be affected by t4.

I am really confused here and hope someone can help with this.
Thanks!

puddingfjz · 2023-11-23T09:10:12Z

puddingfjz
Nov 23, 2023
Author

There is BlockDiagonalCausalMask.from_seqlens() as the attention bias in the prompt stage.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Question about recomputation #1740

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Question about recomputation #1740

Uh oh!

puddingfjz Nov 21, 2023

Replies: 1 comment

Uh oh!

puddingfjz Nov 23, 2023 Author

puddingfjz
Nov 21, 2023

puddingfjz
Nov 23, 2023
Author