What text encoding does `llama_token_to_piece()` return? UTF-8? #3114

crasm · 2023-09-10T23:17:04Z

crasm
Sep 10, 2023

The sentencepiece README states that it normalizes via NFKC. The tokenizer.json files in e.g. jondurbin_airoboros-l2-70b-gpt4-1.4.1 is in UTF-8. I'm not sure how to inspect the tokenizer.model file.

From the perspective of somebody just using llama_token_to_piece(), how do I know what format of text I am getting back from llama.cpp? Would this be dependent on the model's vocab? I have been assuming UTF-8 and it has been working.

Answered by goerch

Sep 19, 2023

llama.cpp doesn't implement any kind of Unicode normalization, so your output depends on the normalization of your input. And I would expect llama_token_to_piece to return UTF-8, yes.

View full answer

misutoneko · 2023-09-12T13:59:53Z

misutoneko
Sep 12, 2023

This script can convert tokenizer.model to text-based files (vocab.json and merges.txt):
https://github.com/huggingface/tokenizers/blob/main/bindings/python/scripts/sentencepiece_extractor.py

0 replies

goerch · 2023-09-19T17:28:09Z

goerch
Sep 19, 2023
Collaborator

llama.cpp doesn't implement any kind of Unicode normalization, so your output depends on the normalization of your input. And I would expect llama_token_to_piece to return UTF-8, yes.

2 replies

marcov-dart Oct 10, 2023

I was using a llama2 chat 7B and I have seen llama_token_to_piece spit out utf8 sequence. But it was outputting the sequence byte-by-byte.
Token 243 -> byte 240
Token 162 -> byte 159
Token 155 -> byte 152
Token 141 -> byte 138
Forming codepoint 0x1f6cc which is 🛌

Token 243 -> byte 240
Token 162 -> byte 159
Token 155 -> byte 152
Token 141 -> byte 138
Forming codepoint 0x1f60a which is a 😊

KerfuffleV2 Oct 10, 2023
Collaborator

You're completely right, the content of individual tokens doesn't have to be valid UTF8. There also isn't even a guarantee that the sequence of tokens it generates will eventually form valid UTF8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What text encoding does `llama_token_to_piece()` return? UTF-8? #3114

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What text encoding does llama_token_to_piece() return? UTF-8? #3114

Uh oh!

crasm Sep 10, 2023

Replies: 2 comments · 2 replies

Uh oh!

misutoneko Sep 12, 2023

Uh oh!

goerch Sep 19, 2023 Collaborator

Uh oh!

marcov-dart Oct 10, 2023

Uh oh!

KerfuffleV2 Oct 10, 2023 Collaborator

What text encoding does `llama_token_to_piece()` return? UTF-8? #3114

crasm
Sep 10, 2023

Replies: 2 comments 2 replies

misutoneko
Sep 12, 2023

goerch
Sep 19, 2023
Collaborator

KerfuffleV2 Oct 10, 2023
Collaborator