Japanese textsplitter behavior #30784
-
Checked other resources
Commit to Help
Example Codefrom langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
class JapaneseCharacterTextSplitter(RecursiveCharacterTextSplitter):
def __init__(self, **kwargs: Any):
separators = ["\n\n", "\n", "。"]
super().__init__(separators=separators, **kwargs) DescriptionHi folks ! There's something I don't understand about TextSplitter by Faiss in langchain_community. The target is Japanese text. Based on RecursiveCharacterTextSplitter, I created the following class: class JapaneseCharacterTextSplitter(RecursiveCharacterTextSplitter):
def __init__(self, **kwargs: Any):
separators = ["\n\n", "\n", "。"]
super().__init__(separators=separators, **kwargs) When testing this, even though there is a "。" just before the specified chunk-size, it is not chunked at that position, but is cut off in the middle of the sentence equivalent to the next chunk-size. The expected behavior is:
Please advice if there is way to improve behavior as expected. Environment:python3.10.16 on conda for Windows11Pro I've read #14348, there seems to be no specific workaround. Thanks in advance. System InfoEnvironment:python3.10.16 on conda for Windows11Pro |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi, you should implement a custom "split_text" method to achieve your expectations. I am sure the code below meets your expectations because, based on the "split_text" method defined and the provided test case, if you run it, you will see the following: the priority for splitting the text is based on the separator, and if a separator occurs earlier than the chunk size, it splits based on it. Otherwise, it considers the earliest separator after the chunk size (to avoid splitting in the middle of a sentence) and splits from there. Also, if there are no separators, it splits based on the chunk size. from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import Any, List
class JapaneseCharacterTextSplitter(RecursiveCharacterTextSplitter):
def __init__(self, **kwargs: Any):
self.separators = ["\n\n", "\n", "。"]
self.chunk_size = kwargs.get("chunk_size", 8)
super().__init__(separators=self.separators, **kwargs)
def split_text(self, text: str) -> List[str]:
chunks = []
while text:
if len(text) <= self.chunk_size:
chunks.append(text)
break
split_index = None
for sep in self.separators:
pos = text.find(sep)
if pos != -1 and pos + len(sep) <= self.chunk_size:
if split_index is None or pos < split_index:
split_index = pos + len(sep)
if split_index is None:
split_index = len(text)
for sep in self.separators:
pos = text.find(sep, self.chunk_size)
if pos != -1 and pos < split_index:
split_index = pos + len(sep)
if split_index == len(text):
split_index = self.chunk_size
chunks.append(text[:split_index])
text = text[split_index:]
return chunks
txt = "これはテす。次の文ですさらに別。の文ですさらの文ですさらの文"
splitter = JapaneseCharacterTextSplitter(chunk_size=8, chunk_overlap=0)
chunks = splitter.split_text(txt)
print(chunks) |
Beta Was this translation helpful? Give feedback.
@SwHaraday, you are right—a small bug was present in the code. The bug has been removed, and a few slight modifications have also been made. Disregard the previous comment, consider the following code, and run it: