Replies: 1 comment
-
You can achieve that with context-shift. Use |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to achieve a behavior where I add one token to the context, compute the logits for the next token, and then repeat the process. I don't fully understand how llama.cpp manages memory. In the simple example, a new batch is created with just one token, so I assumed that it caches previous tokens (perhaps using a kv cache or something similar). However, as I'm new to neural networks, my understanding might be off. After a brief look, I thought it might be using a ring buffer, allowing me to simply add batches, but I quickly realized that isn't the case.
My goal is to implement a sliding window context. For instance, if
n_ctx
is 5 and the current context is"12345"
, adding"6"
should result in"23456"
.Here is the code I wrote based on the "simple" example:
I'm using the model
llama-3.2-3b-q8_0.gguf
with the CUDA backend, in case that makes any difference.How can I achieve the sliding window behavior in llama.cpp? Is there an internal mechanism that supports this, or do I need to manually manage the context to discard the oldest token when adding a new one? I would truly appreciate any insights or guidance you can share.
Beta Was this translation helpful? Give feedback.
All reactions