Skip to content

Japanese textsplitter behavior #30784

Closed Answered by mghiasvand1
SwHaraday asked this question in Q&A
Discussion options

You must be logged in to vote

@SwHaraday, you are right—a small bug was present in the code. The bug has been removed, and a few slight modifications have also been made. Disregard the previous comment, consider the following code, and run it:

from langchain.text_splitter import RecursiveCharacterTextSplitter

class JapaneseCharacterTextSplitter(RecursiveCharacterTextSplitter):
    def __init__(self, **kwargs):
        self.separators = ["。"]
        self.chunk_size = kwargs.get("chunk_size")
        super().__init__(separators=self.separators, **kwargs)

    def split_text(self, text: str):
        chunks = []
        while text:
            split_index = None
            for sep in self.separators:
                pos 

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@SwHaraday
Comment options

@mghiasvand1
Comment options

Answer selected by SwHaraday
@SwHaraday
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants