llama-embedding result confused. #12100

swordow · 2025-02-28T03:32:32Z

swordow
Feb 28, 2025

I tried following 3 test cases, but the result is confused.
Test1, create embedings for "Hello" (dont add special/bos)
llama-embedding.exe -m gte-qwen2-7b-instruct-f16.gguf -e -p "Hello" --verbose-prompt -ngl 0 --batch-size 4096
and get the outputs:
embedding 0: [-0.010509 -0.007925 -0.006991 ... -0.010548 -0.014585 0.018345 ]
Test2, create embedings for "test" (dont add special/bos):
llama-embedding.exe -m gte-qwen2-7b-instruct-f16.gguf -e -p "test" --verbose-prompt -ngl 0 --batch-size 4096
and get the outputs:
embedding 0: [-0.000707 0.007401 0.001886 ... -0.014110 -0.003793 0.016024 ]
Test3, create embedings for "Hello test" (dont add special/bos):
llama-embedding.exe -m gte-qwen2-7b-instruct-f16.gguf -e -p "Hello test" --verbose-prompt -ngl 0 --batch-size 4096
and get the outputs:

embedding 0: [-0.010513 -0.007934 -0.006993  ... -0.010563 -0.014576  0.018352 ]
embedding 1: [ 0.005916 -0.000861 -0.004125  ... -0.011239 -0.020057  0.012426 ]

The Test3's embeding 0 is consistent with Test1's embedding 0, but The Test3's embeding 1 is not onsistent with Test2's embedding 0.

ggerganov · 2025-02-28T06:54:16Z

ggerganov
Feb 28, 2025
Maintainer

Use --verbose-prompt to see the difference in the 2 cases.

0 replies

baxmet · 2025-02-28T10:23:12Z

baxmet
Feb 28, 2025

@swordow positional embedding shift?

0 replies

swordow · 2025-03-01T13:28:03Z

swordow
Mar 1, 2025
Author

@baxmet
@ggerganov

There are two problems here:
The first is that while converting model from hf to gguf, the convert py does not read pooling type from specified dir for example 1_Pooling for all models and of course it is not saved into gguf file, so the default pooling type is NONE.
Also llama.cpp does not read the pooling type for all models from gguf model data.
fixed commits:
swordow@17c3c6d.

And then this model uses the lasttoke pooling type, so in the test3 case, it should output only one embeding after this fix:

embedding 0: [ 0.005916 -0.000861 -0.004125  ... -0.011239 -0.020057  0.012426 ]

and the second problem is that the space in this model is trained as the part of the world, so test should not be same with test,
and there embedding should not be same.

I found the problem. "Hello test" is splitted into two tokens: `Hello` and ` test` and the space `' '` is not stripped. The llama.cpp uses the unicode split regex from model gte-qwen2-7b-instruct. The model gte-qwen2-7b-instruct use the vocab pre type qwen2 and its split regex is from here: https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct/raw/main/tokenizer.json ``` "pre_tokenizer": { "type": "Sequence", "pretokenizers": [ { "type": "Split", "pattern": { "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" }, "behavior": "Isolated", "invert": false }, { "type": "ByteLevel", "add_prefix_space": false, "trim_offsets": false, "use_regex": false } ] }, ``` which is consistent with code in https://github.com/ggml-org/llama.cpp/blob/master/src/llama-vocab.cpp ``` case LLAMA_VOCAB_PRE_TYPE_QWEN2: regex_exprs = { // original regex from tokenizer.json // "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", }; break; ``` . Maybe there is a problem in this regex.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-embedding result confused. #12100

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

llama-embedding result confused. #12100

Uh oh!

Uh oh!

swordow Feb 28, 2025

Replies: 3 comments

Uh oh!

ggerganov Feb 28, 2025 Maintainer

Uh oh!

baxmet Feb 28, 2025

Uh oh!

Uh oh!

swordow Mar 1, 2025 Author

swordow
Feb 28, 2025

ggerganov
Feb 28, 2025
Maintainer

baxmet
Feb 28, 2025

swordow
Mar 1, 2025
Author