improvement for semantic chunking logic #30751

OnAnd0n · 2025-04-09T16:24:43Z

OnAnd0n
Apr 9, 2025

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain_experimental.text_splitter import SemanticChunker, SemanticChunker_rev
from langchain_huggingface import HuggingFaceEmbeddings


text="""where is a toilet?
embedding model is a bge-m3.
bge-m3 is trained for multi-task.
computer is broken, but soon repaired."""


embedding_model = HuggingFaceEmbeddings(model_name = 'BAAI/bge-m3',
                                        )


text_splitter = SemanticChunker(embedding_model,
                                breakpoint_threshold_type= 'percentile',
                                breakpoint_threshold_amount= 30)

text_splitter.split_text(text)

## ouput:
## ['where is a toilet? embedding model is a bge-m3.',
 'bge-m3 is trained for multi-task.',
 'computer is broken, but soon repaired.']



embedding_model = HuggingFaceEmbeddings(model_name = 'BAAI/bge-m3',
                                        )


text_splitter = SemanticChunker_rev(embedding_model,
                                breakpoint_threshold_type= 'percentile',
                                breakpoint_threshold_amount= 30)

text_splitter.split_text(text)
## ouput:
## ['where is a toilet?',
 'embedding model is a bge-m3. bge-m3 is trained for multi-task.',
 'computer is broken, but soon repaired.']

Description

To assist with the explanation, I have attached an illustration below.
In Langchain's current logic, sentence 1 and sentence 2 are highly likely to be grouped together.
This is because it uses the STS value between combine_sentence1 and combine_sentence2 as the basis for separating sentence1 and sentence2.

By applying the modification shown below,
sentence1 can be separated in a more reasonable way without incurring additional cost.
I have reflected this logic and the results are shown 'Example Code'.

Would it be possible to request a PR for this change?

System Info

pip
python 3.10.14
windows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

improvement for semantic chunking logic #30751

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

improvement for semantic chunking logic #30751

Uh oh!

Uh oh!

OnAnd0n Apr 9, 2025

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 0 comments

OnAnd0n
Apr 9, 2025