See the use from Charabia - https://github.com/meilisearch/charabia/blob/main/src/segmenter/thai.rs - https://github.com/meilisearch/charabia/blob/main/src/segmenter/utils.rs They use the longest matching approach. And I'm sure their segmenter doesn't check the valid boundaries, like Thai Character Cluster. There can be a room for improvement.