Flash decoding (flash attention) speed increase for long context #3711

jjohare · 2023-10-21T12:02:31Z

jjohare
Oct 21, 2023

I just came across this and can't see the conversation elsewhere so...
From the blog post

"We present a technique, Flash-Decoding, that significantly speeds up attention during inference, bringing up to 8x faster generation for very long sequences. The main idea is to load the keys and values in parallel as fast as possible, then separately rescale and combine the results to maintain the right attention outputs."

Example forthcoming:
https://github.com/Dao-AILab/flash-attention/tree/main/examples/inference

psugihara · 2023-10-31T19:22:16Z

psugihara
Oct 31, 2023

Looks like phind is using it to speed up inference https://news.ycombinator.com/item?id=38089451

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flash decoding (flash attention) speed increase for long context #3711

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Flash decoding (flash attention) speed increase for long context #3711

Uh oh!

jjohare Oct 21, 2023

Replies: 1 comment

Uh oh!

psugihara Oct 31, 2023

jjohare
Oct 21, 2023

psugihara
Oct 31, 2023