Replies: 1 comment
-
This is already done when Flash Attention is enabled - the KQ matrix is not materialized and the attention is computed in a memory efficient way. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm not sure I really know what I'm talking about. Please disregard it if it is something far beyond common sense. Or may be it already works this way...
It feels that computing the matrix K*t(Q) might take a lot of memory. Longer context size overflows GPU memory fast. In order to compute i-th element of the attention (K*t(Q))*V it is really necessary to only know a single i-th row of the K*t(Q).
Might it be worth computing K*t(Q) matrix row-by-row, multiply the row by V and only store the results? It is only needed to know a single row at a time, and it is only needed once. I.e. i-th row is needed once to compute the i-th element of the result. It feels, this way longer contexts should increase memory requirements linearly rather than quadratically?
Beta Was this translation helpful? Give feedback.
All reactions