Help w/ understanding why an old (hacked together) build of koboldcpp has much faster Mixtral prompt processing than mainline? #5227

kalomaze · 2024-01-30T21:26:25Z

kalomaze
Jan 30, 2024

Dating back to this commit: kalomaze@92497e1

I get 15ms per token (~70t/s prompt processing) instead of the current 25ms-30ms per token (~40t/s prompt processing) before these PR changes were finalized into the mainline branch of the kobold fork. To me, this difference is pretty substantial (1.75x faster prompt eval times)

Unfortunately, since this is a fork, I don't have a clean way to precisely map where and how this regression ostensibly happened. All I know is that my custom build I made to hack in faster prompt processing (before those two PRs were merged), to this day, the fastest build when it comes to Mixtral prompt processing compared to the latest llama.cpp or koboldcpp, and I'd like some help on trying to understand why that is, because there have been too many upstream improvements in other areas for me to stick to something like this.

The generation speeds are somewhat worse on this build (likely because of upstream improvements since then which have improved tg/s but not prompt eval speeds), but the prompt processing/batching seems to be clearly superior by a large margin.

I noticed this quirk 2-3 weeks ago, so it doesn't appear to be a recent regression or anything like that (nor was it caused by the multi-GPU changes); it was around the same time that Mixtral was still new and getting the kinks worked out. I had hoped it was some odd temporary regression but it has persisted.
Perhaps the ggml files could be diffed and compared to see if anything stands out that might be contributing to this?

kalomaze · 2024-01-30T21:30:25Z

kalomaze
Jan 30, 2024
Author

(Paging @JohannesGaessler who can maybe shed some light)

2 replies

JohannesGaessler Jan 31, 2024
Collaborator

I don't understand what you're trying to say here. The release that you linked to clearly says "40ms per second (25 tokens/s)" instead of 15 ms per token.

kalomaze Jan 31, 2024
Author

This was a month ago, before I got a new PC with DDR5 memory and an extra GPU, so it's of course going to be faster on the new rig.
The numbers I'm reporting in this thread are testing on my new PC for both builds.
I can observe the speed difference on both machines, and you can also clearly observe the screenshot below in this thread for evidence of CPU usage being throttled.

kalomaze · 2024-01-30T21:43:41Z

kalomaze
Jan 30, 2024
Author

CPU usage chart when processing 512 batch size on the old build, for 4000 tokens worth of context. 20/33 layers offloaded. -t 8, -tb 8 (8 threads for both batching and regular inference)

That same model preset on a new build:

Tons and tons of jitter and apparent "thread swapping".

Maybe the issue is that CPU threading is done too dynamically or something along those lines and there's overhead in not statically allocating threads to work on certain layers or parts of layers? Just a random guess. Also, I don't have the VRAM to compare full offload with and without unfortunately.

0 replies

kalomaze · 2024-02-08T22:39:04Z

kalomaze
Feb 8, 2024
Author

This was identified as being because of a cap:
#5419

If apply the fix:

0 replies

Help w/ understanding why an old (hacked together) build of koboldcpp has much faster Mixtral prompt processing than mainline? #5227

Uh oh!

Uh oh!

kalomaze Jan 30, 2024

Replies: 3 comments · 2 replies

Uh oh!

kalomaze Jan 30, 2024 Author

Uh oh!

JohannesGaessler Jan 31, 2024 Collaborator

Uh oh!

Uh oh!

kalomaze Jan 31, 2024 Author

Uh oh!

Uh oh!

kalomaze Jan 30, 2024 Author

Uh oh!

Uh oh!

kalomaze Feb 8, 2024 Author

kalomaze
Jan 30, 2024

Replies: 3 comments 2 replies

kalomaze
Jan 30, 2024
Author

JohannesGaessler Jan 31, 2024
Collaborator

kalomaze Jan 31, 2024
Author

kalomaze
Jan 30, 2024
Author

kalomaze
Feb 8, 2024
Author