Skip to content

Tokenization of japan text with disabled default features #229

@generall

Description

@generall

Hi!

We are trying to integrate Charabia in here: qdrant/qdrant#2260
Our big concern is binary size, that's why we are trying to use it with disabled dictionaries for Japanese, Korean and Chinese.

Version 7.2 seemed to have a default behavior of splitting text per-character in this case:

本日の日付は -> ["本", "日", "の", "日", "付", "は"]

which was fine for our purposes. New version, however, doesn't do that anymore:

本日の日付は -> ["本日の日付は"]

I wonder if it is an intended behavior change and is it possible to configure segmenter to behave in a way it worked before?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions