KV Split for lllamafile cli options patch to make llm inference on mac silicon 3X faster #774
creativeautomaton
started this conversation in
Ideas
Replies: 1 comment
-
I'm open to pulling this in if there's a PR and it is well tested |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
To patch a llamafile (which is a single-file distribution of
llama.cpp
) to use KVSplit from [KVSplit GitHub][1], you need to integrate differentiated key/value quantization for the KV cache—allowing, for example, 8-bit keys and 4-bit values. Below is a step-by-step guide and a sample patch outline.1. Preparation
llama.cpp
, especially in the attention/KV cache handling code.2. Patch Outline
a. Add CLI Flags
Add support for:
--kvq-key N
(bits for keys)--kvq-val N
(bits for values)in your CLI argument parsing code.
b. Modify KV Cache Data Structures
Update the KV cache struct to store keys and values with separate quantization.
c. Update Attention Mechanism
When writing to or reading from the KV cache, use the specified quantization for keys and values. This involves:
kvq_key_bits
on writekvq_val_bits
on writeExample pseudocode:
d. Pass Quantization Settings
Ensure these settings are passed from CLI to the model and cache initialization routines.
3. Example llamafile Patch (Unified Diff Format)
Below is a minimal patch outline (for illustration; actual implementation will require deeper integration with llamafile's codebase):
4. Usage Example
After building your patched llamafile, you can run:
This will use 8-bit keys and 4-bit values for the KV cache, as in KVSplit's recommended K8V4 configuration[1].
5. Reference
For a full implementation and more advanced features (e.g., Metal support, benchmarking), see the [KVSplit repository][1] and its patching scripts.
Note:
This is a high-level patch outline. For a production patch, carefully review the [KVSplit codebase][1], especially the changes to llama.cpp, and adapt them to llamafile's code structure. Test thoroughly for correctness and performance.
[1]
[1] https://github.com/dipampaul17/KVSplit
[2] ollama/ollama#7274
[3] https://www.reddit.com/r/LocalLLaMA/comments/[1](https://github.com/dipampaul17/KVSplit)h62u1p/ollama_has_merged_in_kv_cache_quantisation/
[4] https://www.digitalocean.com/community/tutorials/splitting-llms-across-multiple-gpus
[5] https://techdocs.broadcom.com/us/en/vmware-tanzu/platform-services/genai-on-tanzu-platform-for-cloud-foundry/10-0/ai-cf/explanation-understanding-ollama-configuration.html
[6] https://github.com/ollama/ollama/blob/main/fs/ggml/ggml.go
[7] https://docs.spring.io/spring-ai/reference/api/chat/ollama-chat.html
[8] https://collabnix.com/ollama-vs-chatgpt-2025-complete-technical-comparison-guide/
[9] https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/
[10] https://itnext.io/ai-introduction-to-ollama-for-local-llm-launch-a95e5200c3e7
Beta Was this translation helpful? Give feedback.
All reactions