How do I combine VLLM with flash Attention-based llama #2784
alex1996-ljl
announced in
Q&A
Replies: 1 comment 1 reply
-
vllm uses flash attention (+ many other optimizations for inference). There is nothing you have to do to enable these, should work out of the box |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
How to combine VLLM and llama based on flash attention? My current application is llama model based on flash attention, but I want to improve efficiency through vllm. Do you have any scheme that can effectively combine the two?
Beta Was this translation helpful? Give feedback.
All reactions