Tokenization of japan text with disabled default features

Hi!

We are trying to integrate Charabia in here: https://github.com/qdrant/qdrant/pull/2260
Our big concern is binary size, that's why we are trying to use it with disabled dictionaries for Japanese, Korean and Chinese.

Version 7.2 seemed to have a default behavior of splitting text per-character in this case:

`本日の日付は` -> `["本", "日", "の", "日", "付", "は"]`

which was fine for our purposes. New version, however, doesn't do that anymore:

`本日の日付は` -> `["本日の日付は"]`

I wonder if it is an intended behavior change and is it possible to configure segmenter to behave in a way it worked before?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenization of japan text with disabled default features #229

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenization of japan text with disabled default features #229

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions