Replies: 3 comments 2 replies
-
(Paging @JohannesGaessler who can maybe shed some light) |
Beta Was this translation helpful? Give feedback.
2 replies
-
Beta Was this translation helpful? Give feedback.
0 replies
-
This was identified as being because of a cap: If apply the fix: ![]() |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Dating back to this commit: kalomaze@92497e1
I get 15ms per token (~70t/s prompt processing) instead of the current 25ms-30ms per token (~40t/s prompt processing) before these PR changes were finalized into the mainline branch of the kobold fork. To me, this difference is pretty substantial (1.75x faster prompt eval times)
Unfortunately, since this is a fork, I don't have a clean way to precisely map where and how this regression ostensibly happened. All I know is that my custom build I made to hack in faster prompt processing (before those two PRs were merged), to this day, the fastest build when it comes to Mixtral prompt processing compared to the latest llama.cpp or koboldcpp, and I'd like some help on trying to understand why that is, because there have been too many upstream improvements in other areas for me to stick to something like this.
The generation speeds are somewhat worse on this build (likely because of upstream improvements since then which have improved tg/s but not prompt eval speeds), but the prompt processing/batching seems to be clearly superior by a large margin.
I noticed this quirk 2-3 weeks ago, so it doesn't appear to be a recent regression or anything like that (nor was it caused by the multi-GPU changes); it was around the same time that Mixtral was still new and getting the kinks worked out. I had hoped it was some odd temporary regression but it has persisted.
Perhaps the ggml files could be diffed and compared to see if anything stands out that might be contributing to this?
Beta Was this translation helpful? Give feedback.
All reactions