Can I use CUSTOM buffer type to optimize the KVCache? #13670
-
Recently, I've noticed that I might be able to achieve optimizations like ARM SIMD through llama.cpp/src/llama-kv-cache.cpp Lines 72 to 110 in b7a1746 I'm unsure if directly modifying the buffer type here in the current version of llama.cpp would be safe and if it would allow for operator optimizations with special layouts. Additionally, if I need to optimize the KV cache using a specific buffer type, would I need to do something similar to what's done in llama.cpp/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Lines 6368 to 6414 in b7a1746 |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
No, the KV cache requires some operations that aren't implemented in the extra buffer types. I expect that at least there would be problems with |
Beta Was this translation helpful? Give feedback.
-
So, I'd like to ask, if I need to implement an |
Beta Was this translation helpful? Give feedback.
It is feasible, but it may require significant changes that may be hard to make without previous knowledge of the ggml code. Mainly, you would need to implement the missing operations, and ensure that they are properly routed to the extra buffer type compute functions.