-
I see in Where does the delta come into play for Also, what happens when |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
In the attention, the positions of the tokens is encoded via RoPE (i.e. rotations of the hidden state). Since the RoPE encoding is additive, we can "shift" cached keys by applying RoPE using the delta in the new and old positions. We don't apply it for the values (V) because they are not RoPEd explicitly This operation is not mathematically equivalent to recomputing the new context from scratch, but it is much faster and seems to produce reasonable results for some reason |
Beta Was this translation helpful? Give feedback.
In the attention, the positions of the tokens is encoded via RoPE (i.e. rotations of the hidden state). Since the RoPE encoding is additive, we can "shift" cached keys by applying RoPE using the delta in the new and old positions. We don't apply it for the values (V) because they are not RoPEd explicitly
This operation is not mathematically equivalent to recomputing the new context from scratch, but it is much faster and seems to produce reasonable results for some reason