Replies: 1 comment
-
Yes, I saw this PR. But to quote Diego's statement in the PR discussion
I fully agree with that. The back-end is really fragile, so performance gains must be way more than 2-3% to warrant a change such as that one. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Title : Overlap CUDA graph building and processing to minimize GPU idle time and improve tokens per seconds performance.
#11867
Link : ggml-org/llama.cpp#11867
Author : @aendk
Use : a few % boost on Cuda PP and TG?
Beta Was this translation helpful? Give feedback.
All reactions