4-byte UTF-16 chars become invalid tokens

The [code](https://github.com/levyfan/sentencepiece-jni/blob/db49073b885b3370406f45a78d06fe217b02d2d2/src/main/native/com_github_google_sentencepiece_SentencePieceJNI.cc#L275C1-L275C62) uses GetStringUTFChars which converts Java Strings (UTF-16) to "Modified UTF-8" prior to encoding, where any 4-byte UTF-16 characters are converted to two 3-byte UTF-8 characters. SentencePiece can't understand "Modified UTF-8" encoding, and it returns the space marker (▁) then the UNK encoding. We need to convert Java Strings to proper UTF-8 following the techniques described here: https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni

Encoding with SentencePiece CLI:
```
printf '𝟲𝟬 𝗪𝗮𝘁𝘁 light bulbs' | spm_encode --model=my_model.model --output_format=id
// 729 9294 445 14785
```

Encoding with JNI:
```
println(sentencePieceProcessor.encodeAsIds("𝟲𝟬 𝗪𝗮𝘁𝘁 light bulbs"))
// 298988, 0, 298988, 0, 445, 14785
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

4-byte UTF-16 chars become invalid tokens #33

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

4-byte UTF-16 chars become invalid tokens #33

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions