-
Notifications
You must be signed in to change notification settings - Fork 97
kimi-k2 convert script and chat template #612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Adapt mainline `PR14653` for tokenizer while maintaining proper MLA tensors. Tested with this workflow using deepseek fp8_cast_bf16.py and triton-cpu to upcast the fp8 safetensors to bf16 safetensors then used this convert_hf_to_gguf.
moonshotai/Kimi-K2-Instruct ikawrakow#609 (comment)
Okay just got the Q8_0 started up and seems coherent in short inferences. Also with this PR it does detect the chat template as such now:
Gonna let this imatrix run and get some sleep. I added specifically
Thanks! |
Thanks! Continuing testing this morning, rolled first test quant Also updated chat template a bit as moonshot seems to have added carriage returns overnight: https://huggingface.co/moonshotai/Kimi-K2-Instruct/blob/main/tokenizer_config.json#L154
👈 Recipe Details#!/usr/bin/env bash
# Quantizing MLA Notes
# https://github.com/ikawrakow/ik_llama.cpp/issues/601#issuecomment-3070185792
# [0,60] Layers
# First Layer has dense ffn_(gate|up|down)
# Remaining layers have 384x exps and 1x shexp
# token_embd.weight - [ 7168, 163840, 1, 1], type = bf16, converting to q8_0 .. size = 2240.00 MiB -> 1190.00 MiB
# blk.0.ffn_down.weight - [18432, 7168, 1, 1], type = bf16, converting to q8_0 .. size = 252.00 MiB -> 133.88 MiB
# blk.0.ffn_gate.weight - [ 7168, 18432, 1, 1], type = bf16, converting to q8_0 .. size = 252.00 MiB -> 133.88 MiB
# blk.0.ffn_up.weight - [ 7168, 18432, 1, 1], type = bf16, converting to q8_0 .. size = 252.00 MiB -> 133.88 MiB
# blk.0.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
# blk.0.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
# blk.0.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
# blk.0.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
# blk.0.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = bf16, converting to q8_0 .. size = 7.88 MiB -> 4.18 MiB
# blk.0.attn_kv_b.weight - [ 512, 16384, 1, 1], type = bf16, converting to q8_0 .. size = 16.00 MiB -> 8.50 MiB
# blk.0.attn_k_b.weight - [ 128, 32768, 1, 1], type = bf16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
# blk.0.attn_v_b.weight - [ 512, 8192, 1, 1], type = bf16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
# blk.0.attn_output.weight - [ 8192, 7168, 1, 1], type = bf16, converting to q8_0 .. size = 112.00 MiB -> 59.50 MiB
# blk.0.attn_q_a.weight - [ 7168, 1536, 1, 1], type = bf16, converting to q8_0 .. size = 21.00 MiB -> 11.16 MiB
# blk.0.attn_q_b.weight - [ 1536, 12288, 1, 1], type = bf16, converting to q8_0 .. size = 36.00 MiB -> 19.12 MiB
# blk.9.ffn_down_exps.weight - [ 2048, 7168, 384, 1], type = bf16, converting to q8_0 .. size = 10752.00 MiB -> 5712.00 MiB
# blk.9.ffn_gate_exps.weight - [ 7168, 2048, 384, 1], type = bf16, converting to q8_0 .. size = 10752.00 MiB -> 5712.00 MiB
# blk.9.ffn_up_exps.weight - [ 7168, 2048, 384, 1], type = bf16, converting to q8_0 .. size = 10752.00 MiB -> 5712.00 MiB
# blk.9.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
# blk.9.exp_probs_b.bias - [ 384, 1, 1, 1], type = f32, size = 0.001 MB
# blk.9.ffn_gate_inp.weight - [ 7168, 384, 1, 1], type = f32, size = 10.500 MB
# blk.9.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
# blk.9.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
# blk.9.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
# blk.9.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = bf16, converting to q8_0 .. size = 28.00 MiB -> 14.88 MiB
# blk.9.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = bf16, converting to q8_0 .. size = 28.00 MiB -> 14.88 MiB
# blk.9.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = bf16, converting to q8_0 .. size = 28.00 MiB -> 14.88 MiB
# blk.9.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = bf16, converting to q8_0 .. size = 7.88 MiB -> 4.18 MiB
# blk.9.attn_kv_b.weight - [ 512, 16384, 1, 1], type = bf16, converting to q8_0 .. size = 16.00 MiB -> 8.50 MiB
# blk.9.attn_k_b.weight - [ 128, 32768, 1, 1], type = bf16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
# blk.9.attn_v_b.weight - [ 512, 8192, 1, 1], type = bf16, converting to q8_0 .. size = 8.00 MiB -> 4.25 MiB
# blk.9.attn_output.weight - [ 8192, 7168, 1, 1], type = bf16, converting to q8_0 .. size = 112.00 MiB -> 59.50 MiB
# blk.9.attn_q_a.weight - [ 7168, 1536, 1, 1], type = bf16, converting to q8_0 .. size = 21.00 MiB -> 11.16 MiB
# blk.9.attn_q_b.weight - [ 1536, 12288, 1, 1], type = bf16, converting to q8_0 .. size = 36.00 MiB -> 19.12 MiB
# output.weight - [ 7168, 163840, 1, 1], type = bf16, converting to q8_0 .. size = 2240.00 MiB -> 1190.00 MiB
#!/usr/bin/env bash
custom="
## Attention [0-60] (GPU)
# Only ik's fork uses this, keep it q8_0 as its only for PP with -mla 3
blk\..*\.attn_kv_b\.weight=q8_0
# ideally k_b and v_b are smaller than q8_0 as they are is used for TG with -mla 3 (and ik's imatrix supports it)
# blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0 or iq4_nl
blk\..*\.attn_k_b\.weight=q5_0
# Balance of attn tensors
blk\..*\.attn_.*=iq5_ks
## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks
## Shared Expert (1-60) (GPU)
blk\..*\.ffn_down_shexp\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks
## Routed Experts (1-60) (CPU)
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
## Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 1 -m 1 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat \
/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x15B-Instruct-safetensors-BF16-00001-of-00045.gguf \
/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-IQ2_KL.gguf \
IQ2_KL \
192 Currently testing perplexity to make sure it runs clean. Also working with the AIBeaverClub folks to test the API endpoint, and having some kind of issue. The model will reply okay sometimes, but other times it takes a little time and returns empty response and the server logs have really high TG when it happens:
But then other times it does respond okay, well formatted, coherent... So hoping maybe just the chat template is off and will hack on it some more before marking ready. EDIT $ python chat_template_tester.py moonshotai/Kimi-K2-Instruct
>> chat template <<
<|im_system|>system<|im_middle|>example system prompt<|im_end|><|im_user|>user<|im_middle|>example user turn 1<|im_end|><|im_assistant|>assistant<|im_middle|>example assistant turn 1<|im_end|><|im_user|>user<|im_middle|>example user turn 2<|im_end|><|im_assistant|>assistant<|im_middle|>
>> end of chat template << No pressure, but happy to hear if you manage to use this convert script on the original fp8 safetensors to get your good MLA bf16 GGUFs (with the attn_kv_b tensor). |
@ubergarm I can test the |
Oh I didn't realize they uploaded the bf16 safetensors that must be just the output of fp8_cast_bf16.py yes that should work as that step does not strip the So far so good, the updated chat template |
How quickly, or rather how slowly, does it go? |
Btw., I have decided to add a sub-2 bpw quant, |
I hope to get some sweep-benches in eventually, anecdotally on short prompts with llama-server seeing around 130~150 tok/sec PP and 10~12 tok/sec TG running CPU-only on a single socket of a Running like so on a single socket. I haven't found the sweet spot for threads given this rig is new to me. model=/mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KL/Kimi-K2-Instruct-IQ2_KL-00001-of-00008.gguf
numactl -N 0 -m 0 \
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/Kimi-K2-Instruct \
--ctx-size 32768 \
-ctk q8_0 \
-fa -fmoe \
-mla 3 \
--parallel 1 \
--threads 64 \
--threads-batch 192 \
--numa numactl \
--host 127.0.0.1 \
--port 8080 |
Okay perplexity ran clean on CPU only implementation:
Happy to merge this now and model will land in hugging face in 10 minutes. |
@ubergarm I don't see the |
Thanks for giving it a try, at least it sounds like this 👈 gguf dump of my bf16 GGUFs$ python ./gguf-py/scripts/gguf_dump.py /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x15B-Instruct-safetensors-BF16-00001-of-00045.gguf
INFO:gguf-dump:* Loading: /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x15B-Instruct-safetensors-BF16-00001-of-00045.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.
* Dumping 48 key/value pair(s)
1: UINT32 | 1 | GGUF.version = 3
2: UINT64 | 1 | GGUF.tensor_count = 36
3: UINT64 | 1 | GGUF.kv_count = 45
4: STRING | 1 | general.architecture = 'deepseek2'
5: STRING | 1 | general.type = 'model'
6: STRING | 1 | general.name = 'Kimi K2 Instruct Bf16 Safetensors'
7: STRING | 1 | general.finetune = 'Instruct-safetensors'
8: STRING | 1 | general.basename = 'Kimi-K2'
9: STRING | 1 | general.size_label = '384x15B'
.
.
.
* Dumping 36 tensor(s)
1: 1174405120 | 7168, 163840, 1, 1 | BF16 | token_embd.weight
2: 7168 | 7168, 1, 1, 1 | F32 | blk.0.attn_norm.weight
3: 132120576 | 18432, 7168, 1, 1 | BF16 | blk.0.ffn_down.weight
4: 132120576 | 7168, 18432, 1, 1 | BF16 | blk.0.ffn_gate.weight
5: 132120576 | 7168, 18432, 1, 1 | BF16 | blk.0.ffn_up.weight
6: 7168 | 7168, 1, 1, 1 | F32 | blk.0.ffn_norm.weight
7: 512 | 512, 1, 1, 1 | F32 | blk.0.attn_kv_a_norm.weight
8: 4128768 | 7168, 576, 1, 1 | BF16 | blk.0.attn_kv_a_mqa.weight
9: 8388608 | 512, 16384, 1, 1 | BF16 | blk.0.attn_kv_b.weight
10: 4194304 | 128, 32768, 1, 1 | BF16 | blk.0.attn_k_b.weight
11: 4194304 | 512, 8192, 1, 1 | BF16 | blk.0.attn_v_b.weight
12: 58720256 | 8192, 7168, 1, 1 | BF16 | blk.0.attn_output.weight
13: 1536 | 1536, 1, 1, 1 | F32 | blk.0.attn_q_a_norm.weight
14: 11010048 | 7168, 1536, 1, 1 | BF16 | blk.0.attn_q_a.weight
15: 18874368 | 1536, 12288, 1, 1 | BF16 | blk.0.attn_q_b.weight
16: 7168 | 7168, 1, 1, 1 | F32 | blk.9.attn_norm.weight
17: 5637144576 | 2048, 7168, 384, 1 | BF16 | blk.9.ffn_down_exps.weight
18: 5637144576 | 7168, 2048, 384, 1 | BF16 | blk.9.ffn_gate_exps.weight
19: 5637144576 | 7168, 2048, 384, 1 | BF16 | blk.9.ffn_up_exps.weight
20: 384 | 384, 1, 1, 1 | F32 | blk.9.exp_probs_b.bias
21: 2752512 | 7168, 384, 1, 1 | F32 | blk.9.ffn_gate_inp.weight
22: 14680064 | 2048, 7168, 1, 1 | BF16 | blk.9.ffn_down_shexp.weight
23: 14680064 | 7168, 2048, 1, 1 | BF16 | blk.9.ffn_gate_shexp.weight
24: 14680064 | 7168, 2048, 1, 1 | BF16 | blk.9.ffn_up_shexp.weight
25: 7168 | 7168, 1, 1, 1 | F32 | blk.9.ffn_norm.weight
26: 512 | 512, 1, 1, 1 | F32 | blk.9.attn_kv_a_norm.weight
27: 4128768 | 7168, 576, 1, 1 | BF16 | blk.9.attn_kv_a_mqa.weight
28: 8388608 | 512, 16384, 1, 1 | BF16 | blk.9.attn_kv_b.weight
29: 4194304 | 128, 32768, 1, 1 | BF16 | blk.9.attn_k_b.weight
30: 4194304 | 512, 8192, 1, 1 | BF16 | blk.9.attn_v_b.weight
31: 58720256 | 8192, 7168, 1, 1 | BF16 | blk.9.attn_output.weight
32: 1536 | 1536, 1, 1, 1 | F32 | blk.9.attn_q_a_norm.weight
33: 11010048 | 7168, 1536, 1, 1 | BF16 | blk.9.attn_q_a.weight
34: 18874368 | 1536, 12288, 1, 1 | BF16 | blk.9.attn_q_b.weight
.
.
. TODO: find a safetensor viewer... |
@ubergarm I haven't run it on your branch. What I'm saying is this quant, created from unsloth's BF16 safetensors and converted to GGUF using llama.cpp does not have |
☝️ that is the step which I believe munges up and omits the If you use the freshly merged ik_llama.cpp/convert_hf_to_gguf.py on those bf16 safetensors, I believe you will get the attn_kv_b tensors in your bf16 GGUF. afaik going from fp8 safetensors upcasting via fp8_cast_bf16.py bf16 safetensors does not mess with the actual tensors.
Unless they did something strange, I believe they should be okay to use with this new convert script. Probably easy enough to test if you have the disk space as nothing more required to download. EDIT I believe this is the code that is munging it up in mainline convert_hf_to_gguf.py. |
HF has one built in just like for GGUF. |
I finally got to some sweep benches feeling out this big dual socket AMD EPYC 9965 192-Core rig in NPS1 with ~768GB RAM per socket. mlc clocks it at around 256GiB/s RAM bandwidth per socket. The "smaller" Kimi-K2-Instruct quants will fit on a single socket. Given I believe this is Zen5 I tried out #610 and did see around 8% boost in PP with that AVX512 kernel. Also increasing ![]() 👈 Command and Data# IQ2_KL 345.687 GiB (2.892 BPW)
model=/mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KL/Kimi-K2-Instruct-IQ2_KL-00001-of-00008.gguf
numactl -N 0 -m 0 \
./build/bin/llama-sweep-bench \
--model "$model"\
--ctx-size 12288 \
-ctk q8_0 \
-fa -fmoe \
-mla 3 \
--threads 128 \
--threads-batch 192 \
-ub 4096 -b 4096 \
--no-mmap \
--numa numactl \
--warmup-batch IQ2_KL --no-mmap -ub 512 -b 2048
IQ2_KL -rtr -ub 512 -b 2048 -rtr
IQ2_KL --no-mmap -ub 4096 -b 4096
IQ2_KL -rtr -ub 4096 -b 4096
IQ2_KL PR610 ik/q8_k_r8_avx512 --no-mmap -ub 4096 -b 4096
IQ4_KS PR610 ik/q8_k_r8_avx512 --no-mmap -ub 4096 -b 4096
I compared the larger IQ4_KS 550.428 GiB (4.604 BPW) and its remarkably similar performance. ![]() |
@ubergarm looks like you're missing an indent:
I fixed it locally, so it can run overnight. |
Thanks @anikifoss I opened a PR here #617 with the fixup, let us know how it looks in the morning! |
Done:
HDDs are not fast 🙄 |
I'll quantize overnight and will let you know how it works tomorrow. |
@ubergarm quantized converted GGUF to Q4_K for down_exps and Q3_K for the other exps. It runs, was able to produce spinning hexagon with 3 tries (Q4/Q3 mix is just under 512GB, but noticably worse than Q6/Q4). |
Marking this draft for now. I'm about done with testing convert after getting sidetracked with an unrelated technical issue. Then I can roll a Q8_0, do imatrix, and make some small enough quants to test the chat template better.
The workflow for converting Kimi-K2-Instruct is roughly documented here: https://huggingface.co/gabriellarson/Kimi-K2-Instruct-GGUF/discussions/1#68746feb3c3f2a7b1e8541ff
UPDATE
My first convert_hf_to_gguf.py just finished and cooking first Q8_0 that seems to have proper tensors to support fast MLA: