KV Split for lllamafile cli options patch to make llm inference on mac silicon 3X faster #774

creativeautomaton · 2025-07-06T19:52:11Z

creativeautomaton
Jul 6, 2025

To patch a llamafile (which is a single-file distribution of llama.cpp) to use KVSplit from [KVSplit GitHub][1], you need to integrate differentiated key/value quantization for the KV cache—allowing, for example, 8-bit keys and 4-bit values. Below is a step-by-step guide and a sample patch outline.

1. Preparation

Clone KVSplit and review the changes it applies to llama.cpp, especially in the attention/KV cache handling code.
Extract the relevant changes (primarily the new quantization logic and CLI flags).
Apply these changes to your llamafile source before building.

2. Patch Outline

a. Add CLI Flags

Add support for:

--kvq-key N (bits for keys)
--kvq-val N (bits for values)

in your CLI argument parsing code.

// In main CLI argument parsing section
int kvq_key_bits = 16; // default FP16
int kvq_val_bits = 16;

if (strcmp(argv[i], "--kvq-key") == 0) {
    kvq_key_bits = atoi(argv[++i]);
} else if (strcmp(argv[i], "--kvq-val") == 0) {
    kvq_val_bits = atoi(argv[++i]);
}

b. Modify KV Cache Data Structures

Update the KV cache struct to store keys and values with separate quantization.

typedef struct {
    // ... existing fields ...
    void *key_cache;   // now quantized to kvq_key_bits
    void *value_cache; // now quantized to kvq_val_bits
    int key_bits;
    int value_bits;
} kv_cache_t;

c. Update Attention Mechanism

When writing to or reading from the KV cache, use the specified quantization for keys and values. This involves:

Quantizing keys to kvq_key_bits on write
Quantizing values to kvq_val_bits on write
Dequantizing as needed for attention computation

Example pseudocode:

void write_kv_cache(kv_cache_t *cache, float *key, float *value, int token_idx) {
    quantize(key, cache->key_cache[token_idx], cache->key_bits);
    quantize(value, cache->value_cache[token_idx], cache->value_bits);
}

d. Pass Quantization Settings

Ensure these settings are passed from CLI to the model and cache initialization routines.

3. Example llamafile Patch (Unified Diff Format)

Below is a minimal patch outline (for illustration; actual implementation will require deeper integration with llamafile's codebase):

--- a/llama.cpp/main.cpp
+++ b/llama.cpp/main.cpp
@@ ... @@
+int kvq_key_bits = 16;
+int kvq_val_bits = 16;
+
 for (int i = 1; i key_bits = kvq_key_bits;
+cache->value_bits = kvq_val_bits;
+
+// When storing keys/values:
+quantize(key, cache->key_cache[token_idx], cache->key_bits);
+quantize(value, cache->value_cache[token_idx], cache->value_bits);

4. Usage Example

After building your patched llamafile, you can run:

./llamafile -m model.gguf --kvq-key 8 --kvq-val 4

This will use 8-bit keys and 4-bit values for the KV cache, as in KVSplit's recommended K8V4 configuration[1].

5. Reference

For a full implementation and more advanced features (e.g., Metal support, benchmarking), see the [KVSplit repository][1] and its patching scripts.

Note:
This is a high-level patch outline. For a production patch, carefully review the [KVSplit codebase][1], especially the changes to llama.cpp, and adapt them to llamafile's code structure. Test thoroughly for correctness and performance.

[1]

[1] https://github.com/dipampaul17/KVSplit
[2] ollama/ollama#7274
[3] https://www.reddit.com/r/LocalLLaMA/comments/[1](https://github.com/dipampaul17/KVSplit)h62u1p/ollama_has_merged_in_kv_cache_quantisation/
[4] https://www.digitalocean.com/community/tutorials/splitting-llms-across-multiple-gpus
[5] https://techdocs.broadcom.com/us/en/vmware-tanzu/platform-services/genai-on-tanzu-platform-for-cloud-foundry/10-0/ai-cf/explanation-understanding-ollama-configuration.html
[6] https://github.com/ollama/ollama/blob/main/fs/ggml/ggml.go
[7] https://docs.spring.io/spring-ai/reference/api/chat/ollama-chat.html
[8] https://collabnix.com/ollama-vs-chatgpt-2025-complete-technical-comparison-guide/
[9] https://smcleod.net/2024/12/bringing-k/v-context-quantisation-to-ollama/
[10] https://itnext.io/ai-introduction-to-ollama-for-local-llm-launch-a95e5200c3e7

cjpais · 2025-07-07T14:27:48Z

cjpais
Jul 7, 2025
Collaborator

I'm open to pulling this in if there's a PR and it is well tested

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KV Split for lllamafile cli options patch to make llm inference on mac silicon 3X faster #774

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

KV Split for lllamafile cli options patch to make llm inference on mac silicon 3X faster #774

Uh oh!

creativeautomaton Jul 6, 2025

1. Preparation

2. Patch Outline

a. Add CLI Flags

b. Modify KV Cache Data Structures

c. Update Attention Mechanism

d. Pass Quantization Settings

3. Example llamafile Patch (Unified Diff Format)

4. Usage Example

5. Reference

Replies: 1 comment

Uh oh!

Uh oh!

cjpais Jul 7, 2025 Collaborator

creativeautomaton
Jul 6, 2025

cjpais
Jul 7, 2025
Collaborator