-
Notifications
You must be signed in to change notification settings - Fork 97
Enable faster prompt processing with mainline llama.cpp GGUFs #409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Else they don't get run-time repacked.
Testing this PR (on top of #405 and #408 PRs), here's a complete log when loading DeepSeek V3 0324 Q2_K_XL. Notably, I had to reduce 1 layer on CUDA 2 (compared to #405 (comment)), as now CUDA 2 was getting OOM. I noticed the compute buffers are ~3.3GB each instead of 2GB and 400MB respectively for each despite using the -fa flag with -mla 3.
I noticed about at 15% improvement on PP t/s over #405 PR, so then that means about 21% faster PP vs main llamacpp (and like 400% improvement (no joke lol) without the #405 PR on ik llamacpp)
Testing with -mla 2, compute buffers are 3.4GB as well vs -mla 3 with -fa. Here it got a small perf improvement (109 t/s PP vs 106 t/s PP). EDIT: I noticed that with this PR we have to specify -mla 1 to make compute buffers smaller, as it doesn't automatically changes it from 0 to 1. |
The compute buffers become larger because one needs extra buffers for the transformed cache. If you are running out of VRAM, you can reduce the compute buffer size using e.g. The extra ~1 GiB in model size is for the newly created |
Mainline llama.cpp PR 12901, which added MLA support for DeepSeek models 2.5 months after MLA was available here, broke backwards compatibility. As a result,
the new DeepSeek GGUFs that started appearing on HF became compatible with
ik_llama.cpp
, so I added support for the incompatible GGUFs in #394. But using such crippled DeepSeek GGUF results in a much lower prompt processing performance. This is because theattn_wkv_b
tensor is missing, so one cannot usemla = 3
.This PR removes this limitation. When
-mla 0 or 2 or 3
is specified on the command line, missingattn_wkv_b
tensors are created on-the-fly while loading the model. This is basically the reverse of #259, where theattn_wk_b
andattn_wv_b
tensors necessary for MLA were computed from theattn_wkv_b
tensors in the original DeepSeek GGUFs.To show why this is useful, the following graph compares PP performance between the main branch and this PR. The
sweep-bench
command isThe model is a mainline
llama.cpp
DeepSeek-Lite GGUF with theattn_wkv_b
tensors missing. In that case themla = 3
parameter will be converted tomla = 1
on the main branch, but trigger the generation of theattn_wkv_b
tensors in this PR (somla = 3
can be used). The model is quantized withQ4_0
, the GPU is RTX-4080. The x-axis isN_KV/1000
, whereN_KV
is the number of tokens in the KV cache. I have used a logarithmic scale for the y axis to better show the growing difference in performance with increasingN_KV
.