-
Notifications
You must be signed in to change notification settings - Fork 97
Added kimi-k2 support (ported from llama.cpp) #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Original patch by @gabriellarson: ggml-org/llama.cpp#14654
I see this warning when loading the model |
Thanks for using your resources (both CPU and BRAIN) for hacking on this behemoth model! I've successfully used the mainline PR version to convert_hf_to_gguf.py the bf16 safetensors created by fp8_cast_to_bf16.py deepseek script and the resulting Q8_0 seems to be working. I'll try to use this PR on the same bf16 safetensors, and hope that the MLA stuff works out and that I don't get that |
I haven't ported the python changes yet, just getting ik_llama to load the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@ikawrakow sorry, I forgot to mark this as a draft. Still waiting for llama.cpp branch to merge... |
I'll open a follow up PR to bring any changes as well as port the python script support. |
Okay yeah I was thinking this might happen as I'd seen it trying to use the "mainline method" instead of the OG fairydreaming evshiron method to preserve the tensors. Yeah that warning is because the "mainline method" handles some MLA tensors differently. I always use the evshiron method for my ik specific quants. So might need to look into the differences in what you have ported and with https://github.com/evshiron/llama.cpp @saood06 and I have been discussing it'd be great to get this all into ik's fork. |
@ubergarm I used unsloth's BF16 safetensors and then converted that to GGUF using llama.cpp, so I skipped the step that gives you the I quantized using unpatched ik_llama, and it seems to be working. |
Okay, then I think my path forward looks something like:
|
The conversion code is currently missing (this was a draft PR, I did not expect it to get merged so fast) |
Ahh okie, things are indeed moving fast. I'm reading up on some more clues from ik here so it might be okay. I'll just use my existing bf16 GGUF then and try it out on ik_llama.cpp and confirm the default behavior is Exciting monday lol 😅 |
It's OK. You can make a separate PR for the Python stuff. In the meantime if someone is really desperate to try the model with |
It'd sure be interesting if someone released an Kimi-K2-Instruct-1000B-A32B-IQ2_KL... |
That is YOUR job 😅 ... I'm sticking to q4+ quants with no imatrix. But not many have enough RAM to run those. My system is using 690G with the DQ4_K quant. |
So yeah I tested this PR too using a "mainline style" Q8_0 i cooked and it is running at least single inference:
Despite quantizing my bf16 GGUF with ik_llama.cpp it still throws that warning, so there are some important details happening differntly in the convert_hf_to_gguf.py between ik_llama.cpp's version and mainline's verison So I'm fussing to see if I can merge in just the changes needed from gabriellarson/llama.cpp/tree/kimi-k2 without messing up the MLA tensors so they stay the OG way... Then I will have a bf16 GGUF with the OG style MLA tensors and can go forward like normal haha... |
@ubergarm I see the following message when running with ik_llama, is this the same issues you are looking at?
|
yo, guys, seriously, just had to comment on this model on two fronts: First, the model is just 1Trillion, and you already have to deal with 2TB BF16 files. Either you look at DFloat11 format and compress the matissa to 11.2bpw perfectly. If not only for ssd savings. I was begging ik to consider working with FP8/FP4 formats in another thread and got rejected. Why go through the FP8-> 2TB BF16 safetensors with triton-cpu -> q8_0 loss->requantize to 2-3bits, when FP4 checkpoints are out there @ 580GB k-l-lambda/Kimi-K2-Instruct-FP4 or baseten/Kimi-K2-Instruct-FP4? I know it is a lot to implement for FP8/FP4. vllm already has a marlin FP4 kernel. SGlang has a petit-nvfp4 WIP kernel for ROCm. What's missing is CPU based NVFP4/FP8 inferencing using bf16 recast. Really, you work with 580GB of weights already done for you. Second comment is for the Kimi K2 model itself. If you haven't read the README, it is only 51 SWE-Bench Verified for non-agent, below R1-0528's 57points. 65 for single agent, but then you have to use tooling, which includes bash. ("Kimi K2 achieves 65.8% pass@1 on the SWE-bench Verified tests with bash/editor tools" So if you want a SWE-bench 8 points higher than R1-0528, you have to expose your bash prompt. Who knows what the bash prompt is calling HTTPS API endpoints, posting your data to which API endpoints? It is such a security risk, are you going to sandbox your bash execution? All I can speculate is that you could theoretically call the Anthropic API point to fudge the benchmark. Then there is the 71 points for multiagent SWE-bench(aka cons=32 or 64). Good luck running 10toks/sec on a 768GB DDR5 EPYC @ cons=64. You could sleep all night and come back in the morning for a cons64 job. Not that impressive 1Trillion model if you care about data security or claimed performance. I suggest that you just either wait for OpenAI's open source model, which calls O3 via HTTP, or just pay 30dollars/month for grok4-coder cons=1 at SWE-bench=72. |
Like I said on HF, if you take the ~2 TB BF16 safetensor you made, then you can just use the |
I think I got it going now: #601 (comment) You'll have to download the ~1TB FP8 yourself and fp8_cast_bf16 them like I show in that hf repo discussion. And if my current test works, I'll open a PR with with the updated ik_llama.cpp convert_hf_to_gguf.py including the Kimi-K2 fixes. (or i could upload the 2TB bf16 with the correct MLA tensors, but would have to check if that is okay with the uplink ata first... haha... 😅 ) If you start with unsloth's bf16 they already have the mainline MLA stuff done to them. |
But where is the fun in that? ;p And besides, I generally don't use LLMs I just enjoy making them go brrr.... |
Do we feed the trolls? 🤔 |
Kimi-K2 has amazing VRAM savings, I can load the full 131k context! I am over the moon with this model 🌚 |
I haven't tried the model at all, but I have heard mixed feedback about it. If you don't mind, how prone to refusals is it? That's the one area I'm most curious about (and will probably affect when/whether or not I end up trying the model locally). |
Why do you call me a troll? That's just not nice. I am realistic. What's the point of running DQ4KM at 690GB or IQ2K IQ3K levels further dropping SWE-bench, if you use it for real work? It took me about a year messing with GGUF to realize that the GGUF format, even with IK's superb IQK quants is such a toy for client side home production, and I am forced to move to original FP4 safetensors format instead or just pay for the top tier models. GGUF got started too early. There's a BF16-> IQ6K compression saving, even at FP8->IQ6K. The compression just disappears when GB200 trains FP4 models natively, no one is dumb enough to run FP4 trained/compressed model at IQ6K. |
Thanks I'll keep an eye on it. But so far it's been amazing at answering my usual benchmark questions. I'll try my goto roo-code project to see how well it does. In terms of refusal, I noticed devstall-small-24b was refusing some of my suggestions. I suspect it's related to agentic lean, when models are taught to avoid uncertain actions to prevent getting into the weeds. Since Kimi-K2 is mainly developed for agentic use, it may have similar tendencies. |
Never heard the term "agentic lean" before. If you are just using it for coding tasks, then I'm not sure you will hit the refusals I care about. It's not even the refusals I care about as bypassing them is rather trivial, but their existence and prevalence tend to correlate with training decisions which impact downstream quality which is what I care about. (Never refusing like abliterated models leads to worse quality from what I've seen, just like a model that refuses too often). |
Sorry, that sounds like something a tech bro would say. Perhaps I was primed somehow 😅. Just sharing my thoughts that these models were both trained for agentic use-cases, so they may share simlar tendencies. |
Not calling you out, just was new vocabulary for me.
That does make sense. I do appreciate your thoughts, no need to apologize. |
sorry i'm taking so long, still testing my convert_hf_to_gguf.py is working, its taking a while i had to restart for hardware stuff, hah... it is just the mainline changes for kimidev applied to the existing ik_llama.cpp fork's convert_hf_to_gguf.py - no need for the evshiron fork technically (though it is convenient to save a step and disk space, but outside this scope for me). the mainline PR is still having some discussion, and from i heard in BeaverAIClub the chat template looks like this (with no newlines) (credit tofumagnate for this info) from converting the official template: https://huggingface.co/moonshotai/Kimi-K2-Base/blob/main/tokenizer_config.json#L154
So probably gonna need something around here: https://github.com/ikawrakow/ik_llama.cpp/blob/main/src/llama.cpp#L23236-L23259 for the chat completions endpoint to detect it and apply it on the server side... UPDATE Anyway, if thing thing finishes finally I can get a Q8_0 that should not have the warning on this fork! What a day lol |
Is that a discord?
I never connected the dots that the chat completion endpoint needs that (probably because I prefer and almost always use the standard completion endpoint). Thanks. |
moonshotai/Kimi-K2-Instruct ikawrakow#609 (comment)
I finally think I'm out of the woods with the convert script... My tmux was dying which would end the process, had to run it in a nohup lol... I think its Anyway, in the mean time I pushed a branch, but want to test it is working with a quant. I also added what I think will be the chat template which also needs testing. I could open a draft PR I suppose at least to have a place holder... https://github.com/ubergarm/ik_llama.cpp/tree/ug/convert-kimi-k2 One step closer! UPDATE: Went ahead and opened a draft PR #612 |
* convert_hf_to_gguf for Kimi-K2-Instruct Adapt mainline `PR14653` for tokenizer while maintaining proper MLA tensors. Tested with this workflow using deepseek fp8_cast_bf16.py and triton-cpu to upcast the fp8 safetensors to bf16 safetensors then used this convert_hf_to_gguf. * Add Kimi-K2 chat template moonshotai/Kimi-K2-Instruct #609 (comment) * kimi-k2 add ass to template to get response
Ported kimi-k2 support from llama.cpp.
Original patch by @gabriellarson