-
Notifications
You must be signed in to change notification settings - Fork 13
Open
Description
The code uses GetStringUTFChars which converts Java Strings (UTF-16) to "Modified UTF-8" prior to encoding, where any 4-byte UTF-16 characters are converted to two 3-byte UTF-8 characters. SentencePiece can't understand "Modified UTF-8" encoding, and it returns the space marker (▁) then the UNK encoding. We need to convert Java Strings to proper UTF-8 following the techniques described here: https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni
Encoding with SentencePiece CLI:
printf '𝟲𝟬 𝗪𝗮𝘁𝘁 light bulbs' | spm_encode --model=my_model.model --output_format=id
// 729 9294 445 14785
Encoding with JNI:
println(sentencePieceProcessor.encodeAsIds("𝟲𝟬 𝗪𝗮𝘁𝘁 light bulbs"))
// 298988, 0, 298988, 0, 445, 14785
Metadata
Metadata
Assignees
Labels
No labels