Replies: 1 comment
-
Looks like phind is using it to speed up inference https://news.ycombinator.com/item?id=38089451 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I just came across this and can't see the conversation elsewhere so...
From the blog post
"We present a technique, Flash-Decoding, that significantly speeds up attention during inference, bringing up to 8x faster generation for very long sequences. The main idea is to load the keys and values in parallel as fast as possible, then separately rescale and combine the results to maintain the right attention outputs."
Example forthcoming:
https://github.com/Dao-AILab/flash-attention/tree/main/examples/inference
Beta Was this translation helpful? Give feedback.
All reactions