Context Sliding Window #11845

Kam1k4dze · 2025-02-13T10:13:22Z

Kam1k4dze
Feb 13, 2025

I'm trying to achieve a behavior where I add one token to the context, compute the logits for the next token, and then repeat the process. I don't fully understand how llama.cpp manages memory. In the simple example, a new batch is created with just one token, so I assumed that it caches previous tokens (perhaps using a kv cache or something similar). However, as I'm new to neural networks, my understanding might be off. After a brief look, I thought it might be using a ring buffer, allowing me to simply add batches, but I quickly realized that isn't the case.

My goal is to implement a sliding window context. For instance, if n_ctx is 5 and the current context is "12345", adding "6" should result in "23456".

Here is the code I wrote based on the "simple" example:

class TokenPMF {
  public:
    llama_model *          model;
    const llama_vocab *    vocab;
    uint32_t               vocab_size;
    llama_token            vocab_bos;
    std::span<const float> logits;
    llama_batch            batch;
    llama_context *        ctx;

    explicit TokenPMF(llama_model * model) :
        model(model),
        vocab(&model->vocab),
        vocab_size(vocab->n_tokens()),
        vocab_bos(vocab->token_bos()),
        batch(llama_batch_get_one(&vocab_bos, 1)) {
        llama_context_params ctx_params = llama_context_default_params();
        // ctx_params.n_ctx                = 10000;
        // ctx_params.n_batch              = 1;
        // ctx_params.no_perf              = false;
        ctx                             = llama_init_from_model(model, ctx_params);
    }

    void add_token(llama_token new_token) { 
        batch = llama_batch_get_one(&new_token, 1); 
    }

    void update_logits() {
        if (llama_decode(ctx, batch)) {
            fprintf(stderr, "Error: Decoding failed\n");
            throw std::runtime_error("Decoding failed");
        }
        logits = { llama_get_logits_ith(ctx, -1), vocab_size };
    }
};

I'm using the model llama-3.2-3b-q8_0.gguf with the CUDA backend, in case that makes any difference.

How can I achieve the sliding window behavior in llama.cpp? Is there an internal mechanism that supports this, or do I need to manually manage the context to discard the oldest token when adding a new one? I would truly appreciate any insights or guidance you can share.

ggerganov · 2025-02-13T18:58:09Z

ggerganov
Feb 13, 2025
Maintainer

You can achieve that with context-shift. Use n_keep = 0 and n_discard = 1:

https://github.com/ggerganov/llama.cpp/blob/4078c77f9891831f29ffc7c315c8ec6695ba5ce7/examples/main/main.cpp#L552-L585

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Context Sliding Window #11845

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Context Sliding Window #11845

Uh oh!

Uh oh!

Kam1k4dze Feb 13, 2025

Replies: 1 comment

Uh oh!

ggerganov Feb 13, 2025 Maintainer

Kam1k4dze
Feb 13, 2025

ggerganov
Feb 13, 2025
Maintainer