Hi!
We are trying to integrate Charabia in here: qdrant/qdrant#2260
Our big concern is binary size, that's why we are trying to use it with disabled dictionaries for Japanese, Korean and Chinese.
Version 7.2 seemed to have a default behavior of splitting text per-character in this case:
本日の日付は
-> ["本", "日", "の", "日", "付", "は"]
which was fine for our purposes. New version, however, doesn't do that anymore:
本日の日付は
-> ["本日の日付は"]
I wonder if it is an intended behavior change and is it possible to configure segmenter to behave in a way it worked before?