-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
Is your feature request related to a problem?
Currently, the embeddings generated by llama.cpp via local-ai are not normalized. For many applications, especially those involving semantic search or vector similarity calculations with cosine similarity, embeddings must be L2 normalized. This forces developers to perform a normalization step on the client-side after receiving the embedding vector from the API.
Describe the solution you'd like
I propose adding a new boolean option to the embeddings model yaml config file, named embd_normalize (equivalent to llama.cpp arg --embd-normalize) that triggers the normalization. Also, I think this option should be the default when making requests to a OpenAI-like endpoint as /v1/embeddings
but not enabled (by default) in /embeddings
. This is compatible with OpenAI models (and endpoint) that returns L2 normalized vector embeddings.
When this option is set to true, the llama.cpp server will perform an L2 normalization on the final embedding vector before it is returned in the API response (this is already implemented in recent llama.cpp versions). When the option is false or not present, the server should return the raw, non-normalized embedding only for the endpoint /embeddings
but normalized for /v1/embeddings
.
This would allow users to receive ready-to-use, normalized embeddings directly from the API, simplifying client-side logic and improving overall efficiency.
Example model config file:
name: qwen3-embedding-4b
embeddings: true
backend: llama-cpp
context_size: 32768
f16: true
mmap: true
parameters:
model: Qwen3-Embedding-4B-Q8_0.gguf
embd_normalize: true
Describe alternatives you've considered
The only alternative at present is to manually normalize the embedding vectors on the client-side. This involves receiving the raw vector from llama.cpp and then implementing a function to calculate the L2 norm and divide each component of the vector by it. While functional, this approach is less efficient and requires every client application developer to reimplement the same logic.
Additional context
L2 normalization is a standard procedure for preparing embeddings for many machine learning tasks.