Regression in unified KV cache appears after `llama.cpp` release b5912 in b5913

This issue concerns the llama-cpp-python community but was filed on the llama.cpp tracker first: https://github.com/ggml-org/llama.cpp/issues/14847.

I just wanted to bring it to your attention. I can relocate the issue if it is more relevant here.  For your convenience, the issue description is reproduced here:

Running `llama-cpp-python` against `llama.cpp` compiled after b5912, in b5913, results in:

```
llama.cpp/src/llama-kv-cache-unified.cpp:222: GGML_ASSERT(seq_id >= 0 && (size_t) seq_id < seq_to_stream.size()) failed
```

It appears to be a regression in **sequence ID handling** or **unified KV cache logic** affecting external bindings. This is consistent with the heavy work done on the kv-cache to prepare K/V buffers for separation in b5913.

**NOTE**: llama-cli runs successfully, but running `llama-cpp-python` against `llama.cpp` with the same model fails.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regression in unified KV cache appears after `llama.cpp` release b5912 in b5913 #2045

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Regression in unified KV cache appears after llama.cpp release b5912 in b5913 #2045

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Regression in unified KV cache appears after `llama.cpp` release b5912 in b5913 #2045