Japanese textsplitter behavior #30784

SwHaraday · 2025-04-11T05:45:19Z

SwHaraday
Apr 11, 2025

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

class JapaneseCharacterTextSplitter(RecursiveCharacterTextSplitter):

    def __init__(self, **kwargs: Any):
        separators = ["\n\n", "\n", "。"]
        super().__init__(separators=separators, **kwargs)

Description

Hi folks !

There's something I don't understand about TextSplitter by Faiss in langchain_community.
Could someone please help?

The target is Japanese text.

Based on RecursiveCharacterTextSplitter, I created the following class:

class JapaneseCharacterTextSplitter(RecursiveCharacterTextSplitter):

    def __init__(self, **kwargs: Any):
        separators = ["\n\n", "\n", "。"]
        super().__init__(separators=separators, **kwargs)

When testing this, even though there is a "。" just before the specified chunk-size, it is not chunked at that position, but is cut off in the middle of the sentence equivalent to the next chunk-size.

The expected behavior is:

Do not cut in the middle of a sentence,
If there is a "。", cut at that position even if it is smaller than the chunk-size.

Please advice if there is way to improve behavior as expected.

Environment:python3.10.16 on conda for Windows11Pro
　　　　　　faiss 1.10.0
langchain 0.3.20
　　　　　　langchain_community 0.3.19

I've read #14348, there seems to be no specific workaround.

Thanks in advance.

System Info

Environment:python3.10.16 on conda for Windows11Pro
　　　　　　faiss 1.10.0
langchain 0.3.20
　　　　　　langchain_community 0.3.19

Answered by mghiasvand1

Apr 14, 2025

@SwHaraday, you are right—a small bug was present in the code. The bug has been removed, and a few slight modifications have also been made. Disregard the previous comment, consider the following code, and run it:

from langchain.text_splitter import RecursiveCharacterTextSplitter

class JapaneseCharacterTextSplitter(RecursiveCharacterTextSplitter):
    def __init__(self, **kwargs):
        self.separators = ["。"]
        self.chunk_size = kwargs.get("chunk_size")
        super().__init__(separators=self.separators, **kwargs)

    def split_text(self, text: str):
        chunks = []
        while text:
            split_index = None
            for sep in self.separators:
                pos

View full answer

mghiasvand1 · 2025-04-13T07:47:03Z

mghiasvand1
Apr 13, 2025

Hi, you should implement a custom "split_text" method to achieve your expectations. I am sure the code below meets your expectations because, based on the "split_text" method defined and the provided test case, if you run it, you will see the following: the priority for splitting the text is based on the separator, and if a separator occurs earlier than the chunk size, it splits based on it. Otherwise, it considers the earliest separator after the chunk size (to avoid splitting in the middle of a sentence) and splits from there. Also, if there are no separators, it splits based on the chunk size.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import Any, List

class JapaneseCharacterTextSplitter(RecursiveCharacterTextSplitter):
    def __init__(self, **kwargs: Any):
        self.separators = ["\n\n", "\n", "。"]
        self.chunk_size = kwargs.get("chunk_size", 8)
        super().__init__(separators=self.separators, **kwargs)

    def split_text(self, text: str) -> List[str]:
        chunks = []
        while text:
            if len(text) <= self.chunk_size:
                chunks.append(text)
                break

            split_index = None
            for sep in self.separators:
                pos = text.find(sep)
                if pos != -1 and pos + len(sep) <= self.chunk_size:
                    if split_index is None or pos < split_index:
                        split_index = pos + len(sep)
            
            if split_index is None:
                split_index = len(text)  
                for sep in self.separators:
                    pos = text.find(sep, self.chunk_size)
                    if pos != -1 and pos < split_index:
                        split_index = pos + len(sep)
                
                if split_index == len(text):
                    split_index = self.chunk_size

            chunks.append(text[:split_index])
            text = text[split_index:]
        return chunks

txt = "これはテす。次の文ですさらに別。の文ですさらの文ですさらの文"
splitter = JapaneseCharacterTextSplitter(chunk_size=8, chunk_overlap=0)
chunks = splitter.split_text(txt)
print(chunks)

3 replies

SwHaraday Apr 14, 2025
Author

Hi mghiasvand1,
Thank you for your kind advice.
I tested it and result was not as expected.

I changed

        separators = ["\n\n", "\n", "。"]

to

        separators = ["。"]

and call it as:

split_texts = loader.load_and_split(
    text_splitter=JapaneseCharacterTextSplitter(
        chunk_size=30,
        chunk_overlap=0,
        keep_separator='end', 
    )

Result is almost as I expected.
Text was not chunked within sentence until '。' found, even if the length is over chunksize
But still confused with unintended '\n' among result texts.

If I found workaround, I will make notes here.

BR, Yuji

mghiasvand1 Apr 14, 2025

@SwHaraday, you are right—a small bug was present in the code. The bug has been removed, and a few slight modifications have also been made. Disregard the previous comment, consider the following code, and run it:

from langchain.text_splitter import RecursiveCharacterTextSplitter

class JapaneseCharacterTextSplitter(RecursiveCharacterTextSplitter):
    def __init__(self, **kwargs):
        self.separators = ["。"]
        self.chunk_size = kwargs.get("chunk_size")
        super().__init__(separators=self.separators, **kwargs)

    def split_text(self, text: str):
        chunks = []
        while text:
            split_index = None
            for sep in self.separators:
                pos = text.find(sep)
                if pos != -1 and pos + len(sep) <= self.chunk_size:
                    if split_index is None or pos < split_index:
                        split_index = pos + len(sep)
            
            if split_index is None:
                split_index = len(text)
                for sep in self.separators:
                    pos = text.find(sep, self.chunk_size)
                    if pos != -1 and pos < split_index:
                        split_index = pos + len(sep)
                
                if split_index == len(text):
                    split_index = self.chunk_size

            chunks.append(text[:split_index])
            text = text[split_index:]
            
        return chunks

class MockLoader:
    def __init__(self, text: str):
        self.text = text

    def load_and_split(self, text_splitter):
        return text_splitter.split_text(self.text)

japanese_text = "これはテす。次の文ですさらに別。の文ですさらの文ですさらの文文でらの文ですさらの文文ですさらの。文です"
loader = MockLoader(japanese_text)
split_texts = loader.load_and_split(
    text_splitter=JapaneseCharacterTextSplitter(
        chunk_size=30,
        chunk_overlap=0,
        keep_separator='end',
    )
)

print(split_texts)

After running it, it shows the output: ['これはテす。', '次の文ですさらに別。', 'の文ですさらの文ですさらの文文でらの文ですさらの文文ですさらの。', '文です']. So you can conclude the following: the priority for splitting the text is based on the separator, and if a separator occurs earlier than the chunk size, it splits based on it. Otherwise, it considers the earliest separator after the chunk size (to avoid splitting in the middle of a sentence) and splits from there. Also, if there are no separators, it splits based on the chunk size.

Please mark this answer as the accepted one if this comment has resolved your question.

Answer selected by SwHaraday

SwHaraday Apr 14, 2025
Author

Hi mghiasvand1,

I tested with

japanese_text = "私は鳥が好きです。\n\n鳥が好きと言っても食べ物として好きと言うことではありません。鳥と言う生き物の姿かたち、その行動などに愛着を感じていると言ってよいのかもしれません。\n\n小さいころ実家の一軒家には小さな庭がありました。我々兄弟が大きくなり、祖父が長期の入院を終えて一緒に暮らすことになり建て増しをしましたが、それ以前の庭は結構な大きさがありました。昭和40年代は小学校の校門の外によくヒヨコを売る人が現れました。ヒヨコたちはすべてオスで卵を産まないため役立たずを少しでもお金に換えるために子ども相手の売り物としていたのでしょう。我が家は裕福ではなかったので親に頼んでも買ってもらえることは稀でした。買ってもらっても大概はすぐに死なせてしまうと言うのがいつものパターンでした。ある時、赤や緑に染められたヒヨコが売られていたことがあり兄と私に一羽ずつ買ってもらったことがありました。一羽は脚気で直ぐに死んでしまいましたが、もう一羽は元気にスクスク大きくなり庭で放し飼いが出来るまでになりました。\n何故か分かりませんが、そのニワトリは父親に良くなつき父が会社に行くときには車道に出るまでの私道を父について歩いていき、父が車道の先へと言ってしまうと庭に戻ってくると言う何ともいじらしいことを毎日のようにしていました。\n朝早くから大きな声で「コケコッコー」と鳴きまくり近所迷惑となってしまった彼は、父の会社の友人に引き取られることとなりました。\n\nニワトリ以外でもセキセイインコやジュウシマツをよく飼っていました。母親は子どものころ犬を飼っていて、その犬が役所の人に連れ去られて処分されてしまったことがショックでそれ以来犬は飼いたくなくなった様です。\n\n私の持った所帯も犬猫を飼うことが出来ないマンションですが、子どもたちに命の大切さを教えるためもありセキセイインコを一羽飼いしました。丈夫な女の子で13年近く一緒に暮らしましたが、人間で言えば90歳を超えており天寿を全うして逝ってしまいました。私の代ではもう鳥を飼うことは無いでしょう。"

and result is very good as below.

['私は鳥が好きです。', '\n\n鳥が好きと言っても食べ物として好きと言うことではありません。', '鳥と言う生き物の姿かたち、その行動などに愛着を感じていると言ってよいのかもしれません。', '\n\n小さいころ実家の一軒家には小さな庭がありました。', '我々兄弟が大きくなり、祖父が長期の入院を終えて一緒に暮らすことになり建て増しをしましたが、それ以前の庭は結構な大きさがありました。', '昭和40年代は小学校の校門の外によくヒヨコを売る人が現れました。', 'ヒヨコたちはすべてオスで卵を産まないため役立たずを少しでもお金に換えるために子ども相手の売り物としていたのでしょう。', '我が家は裕福ではなかったので親に頼んでも買ってもらえることは稀でした。', '買ってもらっても大概はすぐに死なせてしまうと言うのがいつものパターンでした。', 'ある時、赤や緑に染められたヒヨコが売られていたことがあり兄と私に一羽ずつ買ってもらったことがありました。', '一羽は脚気で直ぐに死んでしまいましたが、もう一羽は元気にスクスク大きくなり庭で放し飼いが出来るまでになりました。', '\n何故か分かりませんが、そのニワトリは父親に良くなつき父が会社に行くときには車道に出るまでの私道を父について歩いていき、父が車道の先へと言ってしまうと庭に戻ってくると言う何ともいじらしいことを毎日のようにしていました。', '\n朝早くから大きな声で「コケコッコー」と鳴きまくり近所迷惑となってしまった彼は、父の会社の友人に引き取られることとなりました。', '\n\nニワトリ以外でもセキセイインコやジュウシマツをよく飼っていました。', '母親は子どものころ犬を飼っていて、その犬が役所の人に連れ去られて処分されてしまったことがショックでそれ以来犬は飼いたくなくなった様です。', '\n\n私の持った所帯も犬猫を飼うことが出来ないマンションですが、子どもたちに命の大切さを教えるためもありセキセイインコを一羽飼いしました。', '丈夫な女の子で13年近く一緒に暮らしましたが、人間で言えば90歳を超えており天寿を全うして逝ってしまいました。', '私の代ではもう鳥を飼うことは無いでしょう。']

Your advice inspired me a lot.
I appreciate your cooperation.

BR Yuji

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Japanese textsplitter behavior #30784

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Japanese textsplitter behavior #30784

Uh oh!

SwHaraday Apr 11, 2025

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

mghiasvand1 Apr 13, 2025

Uh oh!

Uh oh!

SwHaraday Apr 14, 2025 Author

Uh oh!

Uh oh!

mghiasvand1 Apr 14, 2025

Uh oh!

SwHaraday Apr 14, 2025 Author

SwHaraday
Apr 11, 2025

Replies: 1 comment 3 replies

mghiasvand1
Apr 13, 2025

SwHaraday Apr 14, 2025
Author

SwHaraday Apr 14, 2025
Author