How to make each KV Cache token its own tensor and concat together when needed #7601

mjkpolo · 2024-05-28T22:33:47Z

mjkpolo
May 28, 2024

Hi,

I'm trying to modify the kv cache so that each token is its own tensor, by adding another k_l and v_l which is a double pointer vector. That way each token could be stored on a different device. When the entire cache is needed, I concat the double pointer ggml_tensor array together by using the original k_l and v_l as copy buffers.

The main confusion I have is why I keep running out of memory from the memory pool, and how I should know ahead of time how much memory I need. I have just randomly made it larger until I didn't see the error mesage anymore.

I'm also getting an EXC_BAD_ACCESS segfault from llm_build_kv according to lldb, and not sure what could be causing this, is it because I'm making the context too large? I doubt it because shouldn't that print an error? It's probably how I take the view and copy into it.

The code is disgusting because I'm just trying this idea out. I'd appreciate any guidance. Here are the rough draft changes I've made if you'd like to take a peek.

Thanks!! Any comments are appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to make each KV Cache token its own tensor and concat together when needed #7601

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to make each KV Cache token its own tensor and concat together when needed #7601

Uh oh!

mjkpolo May 28, 2024

Replies: 0 comments

mjkpolo
May 28, 2024