Pathological Quant/CUDA combinations -- How to know what works? #613
Replies: 2 comments 3 replies
-
Instead, you absolutely do not want to split up
instead. If you split |
Beta Was this translation helpful? Give feedback.
-
One more thing: if you have enough VRAM to use batch and u-batch of 4096, you should try removing |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Some quants/tensors seem to be incompatible with CUDA. My current example is a Q6_K (unsloth) quant of Kimi K2. If I leave all routed exp on CPU, I can get e.g. TG=~9tps. There's some VRAM remaining (RTX 8000, Turing, 48GB) so I can put a few e.g. up_exps on GPU. When doing this TG drops to 1tps or worse.
I've seen this phenomena before, trying to offload routed experts with some other quant types (w DeepSeek R1/V3) My understanding (I think somewhere @ubergarm explained it) is that some quants are not supported on CUDA and therefore must be converted before use per token.
PP throughput (~80tps) is not noticeably affected, presumably because of batching. (b=ub=4096)
Good outcome, ~9tps TG
if I change to
TG drops to 1tps or worse.
Assuming the idea is correct, Q6_K is a pathological quant type (at least on Turing) -- how to know this? How can I know what my options are when building GGUFs that match my offload/cpu arrangement?
edit: I shouldn't say they are not supported, but they aren't integrated into a kernel for the required op.
Beta Was this translation helpful? Give feedback.
All reactions