-
Notifications
You must be signed in to change notification settings - Fork 12.5k
Description
Name and Version
/opt/homebrew/bin/llama-server --version
version: 5920 (d9b6910)
built with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
Operating systems
Mac
Which llama.cpp modules do you know to be affected?
llama-server
Command line
/opt/homebrew/bin/llama-server -m Qwen3-Embedding-8B-Q4_K_M.gguf --alias Qwen3-embedding --embedding --pooling last -ub 8192 --verbose-prompt --offline -c 40960 --no-mmap --mlock --port 9008
Problem description & steps to reproduce
I successfully generate working embeddings via the server for a while (works for hours or days maybe; maybe one embedding is being requested per minute) but after a while the embedding vectors start being returned with just null
elements. I see no errors or indicators in the log output when this happens and I need to restart the server to recover.
When the server is in the error state (I omitted the repetitive middle of the vector in the response):
% curl -X POST http://localhost:9008/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "test"}'
{"model":"gpt-3.5-turbo","object":"list","usage":{"prompt_tokens":2,"total_tokens":2},"data":[{"embedding":[null, ... ,null],"index":0,"object":"embedding"}]}%
Repeating the query after process restart:
% curl -X POST http://localhost:9008/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "test"}'
{"model":"gpt-3.5-turbo","object":"list","usage":{"prompt_tokens":2,"total_tokens":2},"data":[{"embedding":[0.027558811008930206, ..., ,0.021016428247094154],"index":0,"object":"embedding"}]}%
I am currently unsure how to reproduce / reduce this or how to come up with a usable test case.
First Bad Commit
No response