Skip to content

Llama.embed crashes when n_batch > 512 #1762

Open
@lsorber

Description

@lsorber

Expected Behavior

Embedding text with a long-context model like BGE-M3 [1] should be able to output token embeddings for more than 512 tokens (this is of interest for 'late interaction' retrieval [2]).

Llama-cpp-python will truncate the input tokens to the first n_batch tokens, where n_batch is 512 by default. The expected behaviour is that setting n_batch to a larger value would allow computing the token embeddings for longer sequences.

[1] https://huggingface.co/BAAI/bge-m3
[2] https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/

Current Behavior

The kernel crashes when embedding text with n_batch > 512. This crash is not specific to the embedding model, for a few models I've tried.

Steps to Reproduce

On a Google Colab T4 instance:

%pip install --quiet --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/12.2 llama-cpp-python==0.3.0

from llama_cpp import LLAMA_POOLING_TYPE_NONE, Llama

embedder = Llama.from_pretrained(
    repo_id="lm-kit/bge-m3-gguf",
    filename="*F16.gguf",
    n_ctx=0,  # Model context is 8192
    n_gpu_layers=-1,
    n_batch=513,  # ← Any value larger than 512 (the default) causes a crash
    embedding=True,
    pooling_type=LLAMA_POOLING_TYPE_NONE,
    verbose=False
)

text = "Hello world" * 1000
embedding = embedder.embed(text)  # ← Crash 💥
len(embedding)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions