Replies: 30 comments 161 replies
-
I trolled through some of the PRs you linked to me and pulled together this rough guide as my notes for getting started with The biggest hurdle so far is needing a custom quant for MLA support. I'll work on that another time as I'm using og unsloth My initial impression is with the right settings it can get faster prompt processing than ktransformers and about the same token generation. Looking forward to trying it with an MLA supported quant. |
Beta Was this translation helpful? Give feedback.
-
Thank you for these results.
#259 should remove this hurdle. With this PR models prepared with mainline |
Beta Was this translation helpful? Give feedback.
-
Just thought you'd want to know this, manually notifying you as edit's don't trigger notifications. |
Beta Was this translation helpful? Give feedback.
-
None that still do that haven't been mentioned in the conversation already, there was an issue with IQ1_S_R4 but that was fixed here: #194
Everything looks reasonable to me (especially since you were thorough and tried a bunch of valid combinations, and any valid combination shouldn't NaN on perplexity, but since all of them do that might help narrow down where the problem lies).
Nice. |
Beta Was this translation helpful? Give feedback.
-
I don't think it's the size that is the issue, iq2_bn_r4 is a bitnet quant. I briefly tested an IQ1_S_R4 which didn't even have the benefit of going to q8_0 for the non expert tensors like you did and I still got FAR more reasonable perplexity numbers (exact values here, with the quant log here ) If you are still experimenting with quant types, you might be able to improve on your Q2_K_R4 at around the same size by replacing the q2_k_r4, and q3_k_r4 which are k quants with similar sized i quants or iqk quants instead of using k quants, this PR #85 has a really nice chart focusing on that quant range (caveat IQ3_KL is not a quant type, it is a quant recipe), and shows how the three different quant types (i, k and iqk) stack up. |
Beta Was this translation helpful? Give feedback.
-
They are actually great. But they are Bitnet quants, so quants for a model that has been trained such that model weights take one of 3 possible values (-1, 0, 1). Hence, they absolutely cannot be used for normal models trained using actual floats. But that does not make them not great. The ternary quants in this repo ( |
Beta Was this translation helpful? Give feedback.
-
The |
Beta Was this translation helpful? Give feedback.
-
@ubergarm huge thanks for this guide! Any chance you could publish the DeepSeek-R1_Q2_K_R4 quant described here? First of all, thanks for doing all the research on running DeepSeek-R1 locally and publishing high quality technical details. Your posts on level1techs and reddit are currently the only good sources of information available on the subject. My internet searches related to purchasing decisions for running DSR1 always end up on one of your posts! I started with a 7975wx system for CPU only inference, and overclocked the memory controller based on your benchmarking on level1techs. Then, based on this guide, I ended up shelling out for an RTX 5090. Switching from CPU only inferencw with ollama to CPU+GPU inferece with ik_llama resulted in a 5x inference speedup. The speed improvement are more pronounced for longer contexts, I am able to get roughly 10 tps inference on a 40k context with the unsloth/DeepSeek-R1-UD-Q2_K_XL quant. Since 5090 has more memory, I offloaded all the small layers onto the GPU with
Would love to get my hands on the DeepSeek-R1_Q2_K_R4 quant! |
Beta Was this translation helpful? Give feedback.
-
Heya @anikiforovopensource , I appreciate the feedback, its been great working with tools provided by the great developers to push the envelope! Glad you have found some of this useful
I updated the guide with a link to the hugging face repo that contains a couple https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF Sorry it is difficult to piece together all the bread crumbs across so many sites, but sounds like you are having good success.
The 5090 is pretty great size 32GB VRAM for the quants I made actually. Use the CPU+GPU example on the model card, you want to be using I would recommend:
I'd love to see any benchmark results, you can see how to run Cheers and good luck, sounds like you have a great rig to experiment! |
Beta Was this translation helpful? Give feedback.
-
Where are my 136k stars 😃 |
Beta Was this translation helpful? Give feedback.
-
Has something changed with how llama-quantize wants the Specifically, it gives me e.g. |
Beta Was this translation helpful? Give feedback.
-
There have been no changes related to custom quants. Can you post your full command? |
Beta Was this translation helpful? Give feedback.
-
Sure! I arrived at:
It also doesn't like q6_k, but is ok with q4_0. I dug around a little, but |
Beta Was this translation helpful? Give feedback.
-
Oh, this is Kawrakow-style usability at its best! The "K" in k-quants need to be capitalized. So, This applies only to |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
@ubergarm I incorporated some of your suggestions and re-run the benchmark.
I ran
From my tests,
I prefer to run R1 instead of V3, so I currently don't have the quant to utilize more RAM. I can run benchmarks on your Benchmark results (system: 7975wx with FCLK=2100 , RAM at 5600MHz, RTX 5090):
Partial benchmark logsGPU-ctk f16 -ctv f16, --override-tensor all_but_3_expsVRAM: 30G, RAM: 216G./build/bin/llama-sweep-bench main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 200, n_threads = 32, n_threads_batch = 32
GPU (best so far)-ctk f16 -ctv f16, --override-tensor down_exps=CPU,gate_exps=CPU,up_exps=CPUVRAM: 18.5G, RAM: 228G./build/bin/llama-sweep-bench main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 200, n_threads = 32, n_threads_batch = 32
GPU-ctk q8_0VRAM: 17.5G, RAM: 228G./build/bin/llama-sweep-bench main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 200, n_threads = 32, n_threads_batch = 32
CPU with ctk=f16./build/bin/llama-sweep-bench
|
Beta Was this translation helpful? Give feedback.
-
This is only true when attention is computed on the GPU (on the GPU |
Beta Was this translation helpful? Give feedback.
-
Hello, I have a question. I'm using a laptop 2060 and I'm trying to speed up partial offloading for Gemma 3 12B. I've compiled your build of llama.cpp with CUDA and AVX2 to see if there's any improvement compared to mainline, however it was noticeably slower. In the readme it is mentioned that for CUDA you need to offload the token embeddings tensors to the GPU, but nowhere can I see the command to do that. I think its --override-tensor but I don't know the specific command. I tried ffn_down_exps=CUDA0 which resulted in a speedup almost on par with main, but using that and ffn_up_exps=CUDA0, gate_exps=CUDA0 results in a performance loss again (although I think the latter of which is only for MoE models?) What is the command for doing that? Thank you! |
Beta Was this translation helpful? Give feedback.
-
Alright. I want to put down some baseline numbers. I've built a system with EPYC 9175F and 768 GB @5600, with 2x RTX 6000 Ada Generation for 96 GB VRAM. Due to my dumb ass and inexperience with this kind of hardware, I'm running without GPUs and RAM is configured at 3600 for the time being. Pulled down ubergarm/DeepSeek-V3-0324-IQ4_K_R4 and running it with ik_llama.cpp on master, with config flags: RTR seems to have a huge impact. Overall things are about 66% faster than mainline llama.cpp with the unsloth 4-bit quant. I'm actually okay with this TG, but I gotta get my PP up 😜; my use case requires trawling through a lot of context. I'll check back in when I get GPU working and RAM at expected speed. |
Beta Was this translation helpful? Give feedback.
-
can you please add to the guide: llama-sweep-bench |
Beta Was this translation helpful? Give feedback.
-
Thanks for putting this guide together! I have to say ik_llama.cpp has been a great experience so far for me:
I'm already very happy with the tokens/s I'm getting from ik_llama.cpp when using DeepSeek-R1-UD-Q2_K_XL:
What I'd like to try to optimize now is the context size. Specs of the machine:
Current maximum context size I managed to get so far was 41000. Full ik_llama.cpp run arguments:
Is there any way to squeeze a larger context size out of this hardware, while maintaining reasonable tokens/s (>15tps)? Thanks for any help and for working on this! |
Beta Was this translation helpful? Give feedback.
-
Hi Everyone, Great thread on the subject, and was very helpful for me to optimize the oldish hardware I currently have to play with this. I wanted to share some of the results of my experiments after reading everything here, and see if anyone has any further suggestions on how to make things faster for CPU only? 1 - I'm using 2 Xeon Gold (Skylake) with 1TB of ram If I enable subcluster and leave interleaving disabled, the 2 CPUs will present 4 numa nodes. With subcluster disabled and interleaving disabled, I get 1 node per CPU. And finally, with numa disabled and interleaving enabled, I get a single node for both CPUs Using the intel mlc tool, the maximum bandwidth is achieved with 1 numa node per CPU, around 100gb / s each. Having a single node for both CPUs gives me around 130gb / s. In theory, going with 2 nodes should be faster, but in reality, it seems like having everything consolidated under a single numa node is the fastest option (around 30% faster). I'm using Windows, perhaps the results would be better on Linux? Best result I got so far: G:\ik_llama>llama-bench.exe --model "G:\Qwen3-235B-A22B-128K-Q8_0-00001-of-00006.gguf" -mla 3 -fa 1 -t 28 --run-time-repack 1
Any suggestions are appreciated! :-) |
Beta Was this translation helpful? Give feedback.
-
What's the easiest method to produce a file that simply applies the --runtime-repack transformation to an existing GGUF? I can run DeepSeek at Q_8 but the startup time is a killer. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone, First, I want to sincerely thank @ikawrakow for this amazing repo (definitely deserves much more attention!), and @ubergarm for his excellent guides, insights, and quants. Big appreciation also goes out to unsloth and bartowski. I'm currently building a new AI/LLM machine. Although it's still a WIP (with some cooling issues), I couldn't resist running some tests. The final setup will run Proxmox, and will have multiple GPUs, but for now, it is AMD Epyc 9355 with 768 GB RAM and single RTX 4090 running Windows. Without much expertise, I managed to compile the library with: cmake -B build -G Ninja ^
-DCMAKE_BUILD_TYPE=Release ^
-DLLAMA_CURL=OFF ^
-DGGML_CUDA=ON ^
-DGGML_BLAS=OFF ^
-DGGML_AVX512=ON ^
-DGGML_AVX512_VNNI=ON ^
-DGGML_AVX512_BF16=OFF ^
-DCMAKE_CUDA_ARCHITECTURES=89
cmake --build build --config Release -j $env:NUMBER_OF_PROCESSORS Honestly, I’m unsure if I'm losing performance by disabling With ik-llama finally running, I tested DeepSeek-V3 quants with various params, and ended up with these:
ResultsObservations and Thoughts
Logs - ubergarm
Logs - unsloth
Logs - bartowski
I have NPS0 set in BIOS, and "LLC as NUMA domain (ACPI SRAT L3 Cache as NUMA domain)" ENABLED. It might be worth re-testing with this option DISABLED. I will test smaller and larger quants, too, but downloads take ages 😃. Anyway, just wanted to say "thanks" and share my excitement 💯. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the kind words!
Please post the compilation errors you get with There are places where I have added GEMM/GEMV implementations optimized for |
Beta Was this translation helpful? Give feedback.
-
I have a dual EPYC 9355 system which normally has 768gb of RAM across 24 channels and scores roughly 720gb/s memory bandwidth on the stream triad test. At the moment, I had a RDIMM failure, so I'm down a stick and I only have 23 channels and 736gb of system RAM. I also have a blackwell 6000 pro on this system. I run with NPS4 set in the system BIOS, so I have 8 numa domains. I typically run Deepseek-V3-0324 671b:Q4_K_XL, so that's the model I'll be showing benchmarks for here. I run this before every llama server startup: echo 0 | sudo tee /proc/sys/kernel/numa_balancing
echo 3 | sudo tee /proc/sys/vm/drop_caches Using ./build/bin/llama-batched-bench \
--model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
--numa numactl \
--threads 32 \
--ctx-size 163840 \
--n-gpu-layers 62 \
-ot ".ffn_.*_exps.=CPU" \
--seed 3407 \
--prio 3 \
--temp 0.3 \
--min-p 0.0 \
--flash-attn \
-npp 512 -ntg 128 -npl 1
main: n_kv_max = 163840, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 62, n_threads = 32, n_threads_batch = 32
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 512 | 128 | 1 | 640 | 24.441 | 20.95 | 5.973 | 21.43 | 30.414 | 21.04 | With ./build/bin/llama-sweep-bench \
--model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
--alias DeepSeek-V3-0324:671b-q4_k_xl \
--numa numactl \
--threads 32 \
--ctx-size 163840 \
--n-gpu-layers 62 \
-ot ".ffn_.*_exps.=CPU" \
--seed 3407 \
--temp 0.3 \
--min-p 0.0 \
--flash-attn \
--host 0.0.0.0 \
-mla 3 \
-fmoe \
-rtr \
--port 11434
main: n_kv_max = 163840, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 62, n_threads = 32, n_threads_batch = 32
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 3.862 | 132.56 | 15.186 | 8.43 |
| 512 | 128 | 512 | 3.851 | 132.94 | 15.240 | 8.40 |
| 512 | 128 | 1024 | 3.873 | 132.19 | 15.232 | 8.40 |
| 512 | 128 | 1536 | 3.925 | 130.45 | 15.253 | 8.39 | I'm just curious: Why is generation tok/s so much lower in Thanks! |
Beta Was this translation helpful? Give feedback.
-
I think you are observing a difference in GPU offload policy. In
which for DeepSeek-R1/V3 translates to 1024 tokens. So, basically, in this benchmark you are not using the GPU at all, everything runs on the CPU when using
will give a nice table with PP and TG performance for 0...32k tokens in the KV cache. I think in Another comment related to the NUMA situation: I don't have access to a NUMA system myself, but people report that, sadly, on dual socket systems they get the best performance by disabling NUMA in the BIOS and running on a single CPU. @ubergarm has done quite a few experiments in that regard. I haven't followed what is happening in |
Beta Was this translation helpful? Give feedback.
-
Just to let you know guys, did some benchmarks on iklcpp on my setup (192GB RAM + 208GB VRAM) on DeepSeek V3/R1/Chimera of Q2_K_XL, IQ3_XXS, IQ3_KS, Q3_K_XL and IQ4_XS on reddit, if you want to take a look! https://www.reddit.com/r/LocalLLaMA/comments/1lwnj5x/performance_benchmarks_on_deepseek/ Performance of ikllamacpp for these kind of setups, is really impressive! |
Beta Was this translation helpful? Give feedback.
-
@ikawrakow here it is with NPS0: mla 3./build/bin/llama-sweep-bench \
--model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
--alias DeepSeek-V3-0324:671b-q4_k_xl \
--threads 32 \
--ctx-size 163840 \
--n-gpu-layers 62 \
-ot ".ffn_.*_exps.=CPU" \
--seed 3407 \
--temp 0.3 \
--min-p 0.0 \
--flash-attn \
--host 0.0.0.0 \
-mla 3 \
-fmoe \
-rtr \
--port 11434
main: n_kv_max = 163840, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 62, n_threads = 32, n_threads_batch = 32
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 3.677 | 139.23 | 12.996 | 9.85 |
| 512 | 128 | 512 | 3.994 | 128.19 | 13.160 | 9.73 |
| 512 | 128 | 1024 | 4.020 | 127.37 | 13.161 | 9.73 |
| 512 | 128 | 1536 | 4.279 | 119.65 | 13.426 | 9.53 |
| 512 | 128 | 2048 | 4.193 | 122.11 | 13.596 | 9.41 |
| 512 | 128 | 2560 | 3.868 | 132.38 | 12.987 | 9.86 |
| 512 | 128 | 3072 | 4.655 | 109.98 | 13.682 | 9.36 |
| 512 | 128 | 3584 | 4.291 | 119.31 | 13.344 | 9.59 |
| 512 | 128 | 4096 | 4.287 | 119.44 | 12.890 | 9.93 |
| 512 | 128 | 4608 | 4.221 | 121.29 | 12.835 | 9.97 | mla 2./build/bin/llama-sweep-bench \
--model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
--alias DeepSeek-V3-0324:671b-q4_k_xl \
--threads 32 \
--ctx-size 163840 \
--n-gpu-layers 62 \
-ot ".ffn_.*_exps.=CPU" \
--seed 3407 \
--temp 0.3 \
--min-p 0.0 \
--flash-attn \
--host 0.0.0.0 \
-mla 2 \
-fmoe \
-rtr \
--port 11434
main: n_kv_max = 163840, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 62, n_threads = 32, n_threads_batch = 32
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 3.766 | 135.95 | 12.805 | 10.00 |
| 512 | 128 | 512 | 3.774 | 135.66 | 12.753 | 10.04 |
| 512 | 128 | 1024 | 3.833 | 133.59 | 13.051 | 9.81 |
| 512 | 128 | 1536 | 4.051 | 126.38 | 13.200 | 9.70 |
| 512 | 128 | 2048 | 3.882 | 131.89 | 13.089 | 9.78 |
| 512 | 128 | 2560 | 3.887 | 131.71 | 13.085 | 9.78 |
| 512 | 128 | 3072 | 3.993 | 128.24 | 13.275 | 9.64 |
| 512 | 128 | 3584 | 4.380 | 116.89 | 13.879 | 9.22 |
| 512 | 128 | 4096 | 4.273 | 119.82 | 13.199 | 9.70 |
| 512 | 128 | 4608 | 4.115 | 124.41 | 12.996 | 9.85 | Doesn't seem to make much difference mla 2 vs 3. PP speed does continue to rise past 32 threads though, which is suprising: ./build/bin/llama-sweep-bench \
--model /data/DeepSeek-V3-0324-GGUF-UD/UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00008.gguf \
--alias DeepSeek-V3-0324:671b-q4_k_xl \
--threads 61 \
--ctx-size 163840 \
--n-gpu-layers 62 \
-ot ".ffn_.*_exps.=CPU" \
--seed 3407 \
--temp 0.3 \
--min-p 0.0 \
--flash-attn \
--host 0.0.0.0 \
-mla 2 \
-fmoe \
-rtr \
--port 11434
main: n_kv_max = 163840, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 62, n_threads = 61, n_threads_batch = 61
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 3.274 | 156.36 | 12.792 | 10.01 |
| 512 | 128 | 512 | 3.174 | 161.33 | 12.924 | 9.90 |
| 512 | 128 | 1024 | 3.099 | 165.22 | 13.011 | 9.84 |
| 512 | 128 | 1536 | 3.204 | 159.83 | 13.140 | 9.74 |
| 512 | 128 | 2048 | 3.196 | 160.22 | 13.131 | 9.75 |
| 512 | 128 | 2560 | 3.093 | 165.54 | 13.327 | 9.60 |
| 512 | 128 | 3072 | 3.443 | 148.70 | 13.393 | 9.56 |
| 512 | 128 | 3584 | 3.369 | 151.97 | 13.454 | 9.51 |
| 512 | 128 | 4096 | 3.413 | 150.02 | 13.577 | 9.43 | |
Beta Was this translation helpful? Give feedback.
-
transferring from kvcache-ai/ktransformers#1417 Short story -- I would like to switch to the ik_llama.cpp from ktransformers (the ktransformers are having huge problems with the stability). I would like to know how I can run Deepseek R1/V3 with 128k context and more. In the ktransformers they used the matrix absorption trick ( https://docs.flashinfer.ai/api/mla.html, https://github.com/madsys-dev/deepseekv2-profile/blob/main/workspace/blog/optimizing-mla.md ) -- that is, the flashinfer allows to use one 24GB GPU to prefill up to 128k context (i never tried more because I didn't know the Deepseek supports 163k). So what can be done currently in my case to support large context? I have a various machines mostly with Threadripper Pro 3995wx (inc. lenovo-locked), overclocked Samsung ECC RAM up to 3200 MT/s and currently up to 3 GPUs RTX 3090 FE per workstation with p2p enabled:
Currently researching what @ubergarm suggested and actually trying to fix the bug in ktransformers. Please advise what can be done. [EDIT]: Currently doing this:
Its running well on a single GPU but its only 41k context. [EDIT2]: it seems to be that lots of people having trouble using flashinfer instead of flash attention. For example: https://github.com/turboderp-org/exllamav3
The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flash infer is not available in flashattn hence the for the full context in ik_llama.cpp its required to have at least 48 GB VRAM which is not ideal.
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
ik_llama.cpp
Last Updated: Tue May 13 03:52:20 PM EDT 2025 (still needs more updates, can't keep up, check through comments below)
NEW: Two new custom quants great for CPU+GPU or CPU only inferencing fitting 32k+ context in under 24GB VRAM here on huggingface ubergarm/DeepSeek-V3-0324-GGUF! or start out with the quant you already have to kick the tires on ik_llama.cpp.
tl;dr;
ik_llama.cpp
is a custom fork of llama.cpp introducing many interesting optimizations for MoE's like DeepSeek-R1 671B.The new SOTA quant types can repack your existing GGUFs on the fly or you can roll your own to maximize quality and speed for your exact system VRAM and RAM availability.
I highly recommend you give
ik_llama.cpp
a try especially for CUDA+CPU or pure CPU inferencing. All the very similar ergonmics as vanillallama-server
that you already know and love.Install
Features
Quick Start
Existing DeepSeek-R1 671B GGUF
Get 64k context with a single 24GB VRAM GPU using your existing unsloth quants like unsloth/DeepSeek-R1-UD-Q2-K_XL!
Custom Quant
I rolled my own custom quant to improve quality while still fitting 32k context in under 24GB VRAM. No need to use
-rtr
as this quant is already repacked so you can still usemmap()
allowing you to run on systems without enough RAM by paging the disk cache. This quant has lower perplexity thanUD-Q2_K_XL
while only being slightly larger/slower. Good size for 256GB RAM systems whereQ4_K_M
doesn't fit.Custom Quants
👇
Click here for how to make your own custom quants including repacking
☝️
Benchmarking
Test Rig
mlc
memory read bandwidthLinux TR24 6.13.0-061300-generic #202501302155 SMP PREEMPT_DYNAMIC Sat Feb 8 09:06:55 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
llama-bench
Note
ik_llama.cpp llama-bench
doesn't seem to iterate over all variables so fix these manually for test cases:-fmoe 0,1
-rtr 0,1
-ot
probably, i didn't test this specifically as always usingexps=CPU
for this rig...It does seem to iterate over variables for
fa
,mla
, andamb
.Perplexity
Even more perplexity logs
There is a lot going on here. There may be some issues with
nan
and "numerical instability" depending on exact quants and llama.cpp forks in use. So this is still evolving.I made the above png graph using the first 35 chunks for easy comparison as generally
nan
didn't appear too early for most quants.I also haven't compared perplexity across
ik_llama.cpp
with different settings (e.g. mla etc) vs vanilla llama.cpp and CPU vs CUDA backends etc.The following exact detailed logs results are not included yet in the graph above.
Q8_0
I ran the unsloth
Q8_0
on that intel6980P CPU only backend with vanillallama.cpp/main@b1b132ef
for a baseline. Note there is no MLA etc yet in this case.ubergarm
Q2_K_R4
This is a custom quant I rolled with
q8_0
for all attention/shared experts/embeddings loaded on GPU. The rest of the MoE down exps areq3_k_r4
and gate/up exps areq2_k_r4
which gives fast speed quant that fits nicely into under 256GB RAM and 24GB VRAM with about 32k context without sacrificing much perplexity.This was run on
ik_llama.cpp@127c6ee6
ubergarm
Q2_K_R4
with various-ser N,1
Testing same quant and config as above but with
-ser 4,1
etc to get a feel for quality vs speed tradeoffs.These were run on
ik_llama.cpp@127c6ee6
ubergarm
IQ2_BN_R4
This is an experimental quant I rolled with
q8_0
for all attention/shared experts/embeddings loaded on GPU. The rest of the MoE down exps areiq2_xs_r4
and gate/up exps areiq2_bn_r4
. However, perplexity looks pretty bad. So I'll likely aim for larger sized model with higher quality quants and make-up speed/accuracy trade off exploring-ser
instead of going very small quants.Looking back on it with advise from the team, bitnet quants are very fast to compute, but only good quality for models trained specifically as a ternary bit-net. So this is not the correct use-case.
This was run on
ik_llama.cpp@127c6ee6
ubergarm
IQ2_K_R4
Another experimental quant with
q8_0
for all GPU layers (with room for 32k context still) anddown=iq3_k_r4
andgate/up=iq2_k_r4
for-ot exps=CPU
CPU offload.Debugging Crashes
Usually no need to do this, as any asserts will print the line number direclty.
TODO
References
Beta Was this translation helpful? Give feedback.
All reactions