Skip to content

4-byte UTF-16 chars become invalid tokens #33

@lewisbobrow

Description

@lewisbobrow

The code uses GetStringUTFChars which converts Java Strings (UTF-16) to "Modified UTF-8" prior to encoding, where any 4-byte UTF-16 characters are converted to two 3-byte UTF-8 characters. SentencePiece can't understand "Modified UTF-8" encoding, and it returns the space marker (▁) then the UNK encoding. We need to convert Java Strings to proper UTF-8 following the techniques described here: https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni

Encoding with SentencePiece CLI:

printf '𝟲𝟬 𝗪𝗮𝘁𝘁 light bulbs' | spm_encode --model=my_model.model --output_format=id
// 729 9294 445 14785

Encoding with JNI:

println(sentencePieceProcessor.encodeAsIds("𝟲𝟬 𝗪𝗮𝘁𝘁 light bulbs"))
// 298988, 0, 298988, 0, 445, 14785

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions