Compute K*t(Q) attention matrix row-by-row and immediately multiply by V #9045

z80 · 2024-08-15T17:48:25Z

z80
Aug 15, 2024

I'm not sure I really know what I'm talking about. Please disregard it if it is something far beyond common sense. Or may be it already works this way...

It feels that computing the matrix K*t(Q) might take a lot of memory. Longer context size overflows GPU memory fast. In order to compute i-th element of the attention (K*t(Q))*V it is really necessary to only know a single i-th row of the K*t(Q).

Might it be worth computing K*t(Q) matrix row-by-row, multiply the row by V and only store the results? It is only needed to know a single row at a time, and it is only needed once. I.e. i-th row is needed once to compute the i-th element of the result. It feels, this way longer contexts should increase memory requirements linearly rather than quadratically?

ggerganov · 2024-08-16T07:17:46Z

ggerganov
Aug 16, 2024
Maintainer

This is already done when Flash Attention is enabled - the KQ matrix is not materialized and the attention is computed in a memory efficient way.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compute K*t(Q) attention matrix row-by-row and immediately multiply by V #9045

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Compute K*t(Q) attention matrix row-by-row and immediately multiply by V #9045

Uh oh!

Uh oh!

z80 Aug 15, 2024

Replies: 1 comment

Uh oh!

ggerganov Aug 16, 2024 Maintainer

z80
Aug 15, 2024

ggerganov
Aug 16, 2024
Maintainer