Can I use CUSTOM buffer type to optimize the KVCache? #13670

Zijie-Tian · 2025-05-20T20:01:29Z

Zijie-Tian
May 20, 2025

Recently, I've noticed that I might be able to achieve optimizations like ARM SIMD through buffer_type. After reading the relevant kvcache source code (llama-kv-cache.cpp), I've come to realize that the current kvcache seems to lack support for extra_buffer_type .

llama.cpp/src/llama-kv-cache.cpp

Lines 72 to 110 in b7a1746

    
           for (uint32_t il = 0; il < hparams.n_layer; il++) { 
        
               if (filter && !filter(il)) { 
        
                   LLAMA_LOG_DEBUG("%s: layer %3d: skipped\n", __func__, il); 
        
                   continue; 
        
               } 
        
               const uint32_t n_embd_k_gqa = hparams.n_embd_k_gqa(il) + hparams.n_embd_k_s(); 
        
               const uint32_t n_embd_v_gqa = hparams.n_embd_v_gqa(il) + hparams.n_embd_v_s(); 
        
               const char * dev_name = "CPU"; 
        
               ggml_backend_buffer_type_t buft = ggml_backend_cpu_buffer_type(); 
        
               if (offload) { 
        
                   auto * dev = model.dev_layer(il); 
        
                   buft = ggml_backend_dev_buffer_type(dev); 
        
                   dev_name = ggml_backend_dev_name(dev); 
        
               } 
        
               LLAMA_LOG_DEBUG("%s: layer %3d: dev = %s\n", __func__, il, dev_name); 
        
               ggml_context * ctx = ctx_for_buft(buft); 
        
               if (!ctx) { 
        
                   throw std::runtime_error("failed to create ggml context for kv cache"); 
        
               } 
        
               ggml_tensor * k; 
        
               ggml_tensor * v; 
        
               k = ggml_new_tensor_2d(ctx, type_k, n_embd_k_gqa, kv_size); 
        
               v = ggml_new_tensor_2d(ctx, type_v, n_embd_v_gqa, kv_size); 
        
               ggml_format_name(k, "cache_k_l%d", il); 
        
               ggml_format_name(v, "cache_v_l%d", il); 
        
               map_layer_ids[il] = layers.size(); 
        
               layers.push_back({ il, k, v }); 
        
           }

I'm unsure if directly modifying the buffer type here in the current version of llama.cpp would be safe and if it would allow for operator optimizations with special layouts.

Additionally, if I need to optimize the KV cache using a specific buffer type, would I need to do something similar to what's done in ggml-cpu-aarch64.cpp, but simply change this section to use the GGML_OP_FLASH_ATTN_EXT operator?

llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Lines 6368 to 6414 in b7a1746

    
           class extra_buffer_type : ggml::cpu::extra_buffer_type { 
        
               bool supports_op(ggml_backend_dev_t, const struct ggml_tensor * op) override { 
        
                   if (    op->op == GGML_OP_MUL_MAT && 
        
                           op->src[0]->buffer && 
        
                           (ggml_n_dims(op->src[0]) == 2) && 
        
                           op->src[0]->buffer->buft == ggml_backend_cpu_aarch64_buffer_type() && 
        
                           ggml_aarch64_get_optimal_repack_type(op->src[0]) 
        
                           ) { 
        
                       if (op->src[1]->buffer && !ggml_backend_buft_is_host(op->src[1]->buffer->buft)) { 
        
                           return false; 
        
                       } 
        
                       if (op->src[1]->type == GGML_TYPE_F32) { 
        
                           return true; 
        
                       } 
        
                       //if (op->src[1]->type == GGML_TYPE_Q8_0) { 
        
                       //    return true; 
        
                       //} 
        
                       // may be possible if Q8_0 packed... 
        
                   } else if (op->op == GGML_OP_MUL_MAT_ID 
        
                           && op->src[0]->buffer 
        
                           && (ggml_n_dims(op->src[0]) == 3) 
        
                           && op->src[0]->buffer->buft == ggml_backend_cpu_aarch64_buffer_type() 
        
                           && ggml_aarch64_get_optimal_repack_type(op->src[0]) 
        
                           ) { 
        
                       if (op->src[1]->buffer && !ggml_backend_buft_is_host(op->src[1]->buffer->buft)) { 
        
                           return false; 
        
                       } 
        
                       if (op->src[1]->type == GGML_TYPE_F32) { 
        
                           return true; 
        
                       } 
        
                       //if (op->src[1]->type == GGML_TYPE_Q8_0) { 
        
                       //    return true; 
        
                       //} 
        
                   } 
        
                   return false; 
        
               } 
        
               ggml::cpu::tensor_traits * get_tensor_traits(const struct ggml_tensor * op) override { 
        
                   if (op->op == GGML_OP_MUL_MAT || op->op == GGML_OP_MUL_MAT_ID) { 
        
                       if (op->src[0]->buffer && op->src[0]->buffer->buft == ggml_backend_cpu_aarch64_buffer_type()) { 
        
                           return (ggml::cpu::tensor_traits *) op->src[0]->extra; 
        
                       } 
        
                   } 
        
                   return nullptr; 
        
               } 
        
           }; 
        
           }  // namespace ggml::cpu::aarch64

Answered by slaren

May 21, 2025

It is feasible, but it may require significant changes that may be hard to make without previous knowledge of the ggml code. Mainly, you would need to implement the missing operations, and ensure that they are properly routed to the extra buffer type compute functions.

View full answer

slaren · 2025-05-21T13:53:27Z

slaren
May 21, 2025
Maintainer

No, the KV cache requires some operations that aren't implemented in the extra buffer types. I expect that at least there would be problems with ggml_cpy operations and with views.

0 replies

Zijie-Tian · 2025-05-21T13:58:55Z

Zijie-Tian
May 21, 2025
Author

So, I'd like to ask, if I need to implement an extra_buffer_type for the KV cache, how should I go about modifying it? Is it even feasible?

2 replies

slaren May 21, 2025
Maintainer

It is feasible, but it may require significant changes that may be hard to make without previous knowledge of the ggml code. Mainly, you would need to implement the missing operations, and ensure that they are properly routed to the extra buffer type compute functions.

Answer selected by Zijie-Tian

Zijie-Tian May 21, 2025
Author

Thank you @slaren !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can I use CUSTOM buffer type to optimize the KVCache? #13670

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can I use CUSTOM buffer type to optimize the KVCache? #13670

Uh oh!

Zijie-Tian May 20, 2025

Replies: 2 comments · 2 replies

Uh oh!

slaren May 21, 2025 Maintainer

Uh oh!

Zijie-Tian May 21, 2025 Author

Uh oh!

Uh oh!

slaren May 21, 2025 Maintainer

Uh oh!

Zijie-Tian May 21, 2025 Author

Zijie-Tian
May 20, 2025

Replies: 2 comments 2 replies

slaren
May 21, 2025
Maintainer

Zijie-Tian
May 21, 2025
Author

slaren May 21, 2025
Maintainer

Zijie-Tian May 21, 2025
Author