-
Notifications
You must be signed in to change notification settings - Fork 97
Use cuBLAS for large batches and quants with block size 16 #559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I tried this "build = 3773 (3dbc843)" on ubergam's DeepSeek-R1-0528-GGUF IQ2_K_R4 with -b 4096 -ub 4096. Both are about the same: Did I something wrong? My rig is Epyc Genoa + 6000 ada. Built with |
EDIT wait, your old test was on 1843ed2 which was before PR557 was merged?? huh, i would imagine you would see some speed boost. compare against the commands i'm using below to see if something else is up? Yeah, the speed boosts specific to IQ2_K_R4 and IQ3_K_R4 quantizations (in the quan you mention) were already added in PR557. This PR is doing a similar thing for some other quant types like Q2_K etc. I just did another test for PR557 using this git sha, which is a bit confusing as I'm not actually testing all the new quants added here. But you can see the speed up is pretty good relative to just before PR557 was merged as shown below: 👈compile, llama-sweep-bench, datacmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CUDA_F16=ON
cmake --build ./build --config Release -j $(nproc)
model=DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
--no-mmap \
--ctx-size 12288 \
-ctk q8_0 \
-mla 3 -fa \
-fmoe \
-amb 512 \
-ngl 99 \
-ot "blk\.(3|4|5|6|7|8|9|10|11|12)\.ffn_.*=CUDA0" \
-ot "blk\.(13|14|15|16|17|18|19|20|21|22)\.ffn_.*=CUDA1" \
-ot exps=CPU \
--warmup-batch \
--threads 24
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q5_0: 61 tensors
llama_model_loader: - type iq4_ks: 116 tensors
llama_model_loader: - type iq5_ks: 435 tensors
llama_model_loader: - type iq2_k_r4: 116 tensors
llama_model_loader: - type iq3_k_r4: 58 tensors PR559@3dbc8437 -ub 512 -b 2048
PR559@3dbc8437 -ub 2048 -b 2048
PR559@3dbc8437 -ub 4096 -b 4096
main@8e5106b2 -ub 512 -b 2048
main@8e5106b2 -ub 2048 -b 2048
main@8e5106b2 -ub 4096 -b 4096
|
Noob question and sorry to ask here, but does this PR apply to sub k quants? Like q2_k_s, q3_k_m, q4_k_l, q5_k_xl, etc |
I thought about it some more, and both this PR559 and PR557 only apply when the mentioned quantized tensors are running on CUDA. So for my quant that you mention, the So to see the speed boost you have to use offload more of those specific layers onto CUDA e.g. This kinda ties into @Panchovix great question, and I'd love to do a video called "What is in a quant?" to explain better, because it is pretty confusing until you dig into it with either You see it has the filename
So things like Personally I don't follow the conventions established in llama-quantize and pretty much always override everything with whatever I want to use. So when you start my
So there is a lot more going on under the hood than the name belies. My personal convention is to name the quant "recipe" filename after whatever the main To keep it relevant to this PR, you need to look inside your gguf and see if any of the mentioned quantizations types apply to tensors which you are running on CUDA. Cheers! |
I know this is confusing. Users specify the quantization with a llama type ( |
Performance impact is easier to test with a dense model. For a MoE model such as DeepSeek-R1/V3, even at a batch size of 4096 tokens, experts process on average just 128 tokens, so still far away from the point where the transition to dequantize+cuBLAS occurs. Most of the self attention computations are within the FA implementation, which does not use the regular matrix multiplications, so there are just a few matrix multiplications left that get affected, but they usually take a small fraction of the overall calculation, so impact is negligible (and, as pointed out by @ubergarm, the test done by @ewhacc is not affected by this PR). But if you are running a dense model with partial offload, you will want to have larger batches/u-batches to minimize the time spent on copying tensors from RAM to VRAM relative to the time spent on actual calculations. In that case you ought to see a measurable impact on PP performance, provided the model contains quantization types affected by this PR. |
Here is an example illustrating my previous post. Running LlaMA-3.1-70B quantized with
I have uploaded only 30 out of 80 layers to the GPU so I can run with the larger u-batch. If instead I use the default u-batch of 512, I can upload 50 layers to the GPU. With that I get |
Okay, I made a few Qwen3-14B dense "pure" quants (q4_K token_embd, q6_K output "head") and seeing roughly 1.4x speedup on PP with this PR over main for This is great and really changes things given 👈 sweep-bench command and resultsCUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
--model "$model" \
--ctx-size 20480 \
-ctk f16 -ctv f16 \
-fa \
-ngl 99 \
-ub 4096 -b 4096 \
--no-mmap \
--warmup-batch \
--threads 1 IQ4_K PR559@3dbc8437 -ub 4096 -b 4096
IQ4_KS PR559@3dbc8437 -ub 4096 -b 4096
IQ5_K PR559@3dbc8437 -ub 4096 -b 4096
IQ5_KS PR559@3dbc8437 -ub 4096 -b 4096
IQ4_K main@5236c98b -ub 4096 -b 4096
IQ5_K main@5236c98b -ub 4096 -b 4096
I didn't check the other remaining quantization types with block size of 16. |
Before you throw these quants away, try |
Sure thing. Also it is interesting now that q6_K is a little faster PP than q4_K at 4096 ub/b |
I guess you have a higher end GPU. On my RTX-4080 for fully offloaded dense model the peak is somewhere between But the |
Here is the IQ2_XS on both forks with dataIQ2_XS ik_llama.cpp@3dbc8437 -ub 2048 -b 2048
IQ2_XS ik_llama.cpp@3dbc8437 -ub 512 -b 2048
IQ2_XS llama.cpp@6c510f3b -ub 2048 -b 2048
IQ2_XS llama.cpp@6c510f3b -ub 512 -b 2048
Interestingly mainline llama.cpp is slightly slower when increasing ubatch size over default. |
Oops, sorry, I misread your |
So, the A6000 has more memory bandwidth than the 4080. This shifts things in favor of dequantize+cuBLAS because the dequantize step is memory bound, so it is quicker on the A6000. I guess this is why with |
Based on @ubergarm's and my own testing this PR looks like a winner, so merging. |
Hi, this is u/smflx in reddit. Thanks a lot for detailed reply. :) Yes, the old test was on 1843ed2 which is little before PR557. This, PR557, PR559 are all the same PP speed of 272 t/s. Yes, it was boosted recently. If the boost specific to IQ2_K_R4 is already added, it's understandable. In your graph showing the boost on IQ2_K_R4, main is before PR557. Right? My PP speed of 272 t/s is similar to your S_PP 276.35 t/s. So, it seems OK. I will check llama-sweep-bench later. Thanks a lot for the the guide. My setup is about the same except : -DGGML_CUDA_F16=ON , -ctk q16_0
I'm using 6000ada, but I think the speed will be the same to a6000. GPUs are not fully utilized. I guess PCIe speed is bottleneck. |
I have tested with the same llama-sweep-bench setup you provided on my rig. I just changed the thread count to '--threads 32', which is optimal for 9534.
|
These performance results look pretty good to me. Has anyone ever reported a better result for hybrid GPU/CPU DeepSeek-R1/V3 inference? |
Haven't managed to test much as I accidentaly wiped my Fedora installation from Windows lol. But I was testing with llama sweep bench and got one error, but can't remember exactly the error, and/or is related to this PR. I have just saved at how I run the model, which is
I managed to see 200 t/s PP and 8.73 t/s TG, but then got a error. Again I will try to update when I get Linux installed again, as offloading + multigpu is just not worth it on Windows, speeds are way worse. |
…kawrakow#559)" This reverts commit 31bd318.
Okay finally installed Fedora yesterday, testing remotely now so it is a bit slower (I'm using software encoding and it uses 2-3 threads)
WIth the same command as above. Sometimes it also crashes with another cuda error but still have to get it again. Again, not sure what is related to. |
Okay finally got the other error.
Sorry for the spam, gonna raise an issue, but I still don't know how to replicate it always. |
While working on #557 I noticed that dequantize+cuBLAS is faster than MMQ for the
iqX_k_r4
quants when the batch size is larger than some threshold.The same applies to all quantization types with block size of 16:
Q2_K, Q3_K, Q6_K, IQ2_XS, IQ2_S, IQ2_K, IQ3_K, IQ4_K, IQ5_K, IQ6_K
. Hence, this PR changes theggml_cuda_should_use_mmq()
function to returnfalse
if the batch size (number of rows in the right matrix) is greater than some quantization type specific threshold.This graph illustrates the PP performance improvement achieved this way for k-quants. Model is LlaMA-3.1-8B-Instruct, GPU is RTX-4080, and in all cases pure quantization is used.
Q2_K
appears to have a particularly bad MMQ implementation (I need to look into that more closely), so there we benefit from switching to dequantize+cuBLAS already at 384 tokens, and achieve a solid 30-35% improvement for batch sizes above 1000 tokens. The MMQ implementation for the other quants (also those not shown) is better, so performance gains are in the range of 10% at a batch size of 4k tokens. For quants with a block size of 32 (all others not listed above) MMQ is always better than dequantize+cuBLAS up to a batch size of 4096 tokens, so they are left unchanged by the PR.