This package contains an implementation of a dynamic sliding-window-based semantic chunking method, with a tunable parameter for the cosine similarity threshold, proposed in this research paper.
The method was primarily inspired by the percentile-based semantic chunking method presented by Greg Kamradt in the 5 Levels Of Text Splitting notebook.
I would also like to acknowledge the work of Brandon Smith and Anton Troynikov for the development of the Chunking Evaluation Framework for RAG systems, which was used during the method's development. In addition to the framework, their work presents an overall study of the quality of different chunking methods.
To install the package, run:
pip install "git+https://github.com/panalexeu/horchunk.git"
The process of method benchmarking is demonstrated in evaluation.ipynb.
The process of binary search-based method tuning on data is demonstrated in tuning.ipynb.
The usage example is also provided as notebook.
from chromadb.utils import embedding_functions
from horchunk.chunkers import WindowChunker
from horchunk.splitters import SentenceSplitter
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2", device='cuda')
chunker = WindowChunker(ef)
text = """Joshua Graham, also known as the Burned Man and formerly known as the Malpais Legate, is a Mormon missionary from the settlement of New Canaan and a co-founder of Caesar's Legion as well as its first legate.
After three decades of service in the Legion, helping Caesar carry out his conquest of the former state of Arizona and much of the former American Southwest, Graham's strategic loss in 2277 to the New California Republic at the First Battle of Hoover Dam resulted in Caesar punishing him by being set on fire and thrown into the Grand Canyon, only to survive.
When Caesar retaliated by sending the White Legs raiders to raze New Canaan and chase the survivors into Zion Canyon in 2281, Graham rallied the Dead Horses tribe as their war-chief to protect Zion and atone for his bloodstained past. He appears as the central character in the Fallout: New Vegas add-on Honest Hearts and is mentioned in Fallout: New Vegas and its add-on Lonesome Road."""
splitter = SentenceSplitter(text)
splits = splitter()
print(splits)
[
"Joshua Graham, also known as the Burned Man and formerly known as the Malpais Legate, is a Mormon missionary
from the settlement of New Canaan and a co-founder of Caesar's Legion as well as its first legate.",
"After three decades of service in the Legion, helping Caesar carry out his conquest of the former state of
Arizona and much of the former American Southwest, Graham's strategic loss in 2277 to the New California Republic
at the First Battle of Hoover Dam resulted in Caesar punishing him by being set on fire and thrown into the Grand
Canyon, only to survive.",
'When Caesar retaliated by sending the White Legs raiders to raze New Canaan and chase the survivors into Zion
Canyon in 2281, Graham rallied the Dead Horses tribe as their war-chief to protect Zion and atone for his
bloodstained past.',
'He appears as the central character in the Fallout: New Vegas add-on Honest Hearts and is mentioned in
Fallout: New Vegas and its add-on Lonesome Road.'
]
print(chunker(splits))
[
<Chunk size=3, chars=800, tokens=171, splits=["Joshua Graham, also known as the Burned Man and formerly known
as the Malpais Legate, is a Mormon missionary from the settlement of New Canaan and a co-founder of Caesar's Legion
as well as its first legate.", "After three decades of service in the Legion, helping Caesar carry out his conquest
of the former state of Arizona and much of the former American Southwest, Graham's strategic loss in 2277 to the
New California Republic at the First Battle of Hoover Dam resulted in Caesar punishing him by being set on fire and
thrown into the Grand Canyon, only to survive.", "When Caesar retaliated by sending the White Legs raiders to raze
New Canaan and chase the survivors into Zion Canyon in 2281, Graham rallied the Dead Horses tribe as their
war-chief to protect Zion and atone for his bloodstained past."]/>,
<Chunk size=1, chars=151, tokens=33, splits=["He appears as the central character in the Fallout: New Vegas
add-on Honest Hearts and is mentioned in Fallout: New Vegas and its add-on Lonesome Road."]/>
]
The benchmarking was conducted using the framework proposed in the Chunking Evaluation Framework. Part of the results for other splitting algorithms were also taken from this work.
The embedding model used is all-MiniLM-L6-v2, and the number of retrieved chunks was set to 5.
The results for the developed dynamic sliding-window-based semantic chunking method are marked with asterisks.
Splitting algorithm | Parameters | Recall | Precision | Precision Ω | IoU |
---|---|---|---|---|---|
Recursive | size: 250 t.; overlap: 125 t. | 78.7 ± 39.2 | 4.9 ± 4.5 | 21.9 ± 14.9 | 4.9 ± 4.4 |
Token Text | size: 250 t.; overlap: 125 t. | 82.4 ± 36.2 | 3.6 ± 3.1 | 11.4 ± 6.6 | 3.5 ± 3.1 |
Recursive | size: 250 t.; overlap: 0 t. | 78.5 ± 39.5 | 5.4 ± 4.9 | 26.7 ± 18.3 | 5.4 ± 4.9 |
Token Text | size: 250 t.; overlap: 0 t. | 77.1 ± 39.3 | 3.3 ± 3.0 | 16.4 ± 10.3 | 3.3 ± 3.0 |
Recursive | size: 200 t.; overlap: 0 t. | 75.7 ± 40.7 | 6.5 ± 6.2 | 31.2 ± 18.4 | 6.5 ± 6.1 |
Token Text | size: 200 t.; overlap: 0 t. | 76.6 ± 38.8 | 4.1 ± 3.7 | 19.1 ± 11.0 | 4.1 ± 3.6 |
Kamradt Mod. | size: 250 t.; overlap: 0 t. | 63.1 ± 46.9 | 2.7 ± 3.8 | 13.5 ± 13.3 | 2.7 ± 3.8 |
Kamradt Mod. | size: 200 t.; overlap: 0 t. | 67.9 ± 44.9 | 3.5 ± 4.1 | 16.0 ± 14.9 | 3.5 ± 4.1 |
Cluster | size: 250 t.; overlap: 0 t. | 77.3 ± 38.6 | 6.1 ± 5.1 | 28.6 ± 16.7 | 6.0 ± 5.1 |
Cluster | size: 200 t.; overlap: 0 t. | 75.2 ± 39.9 | 7.2 ± 6.1 | 33.6 ± 20.0 | 7.2 ± 6.0 |
Dyn. Window* |
|
74.9 ± 40.0 | 7.6 ± 6.5 | 35.0 ± 18.3 | 7.4 ± 6.3 |
Dyn. Window* |
|
68.7 ± 41.3 | 10.3 ± 8.5 | 48.4 ± 19.8 | 10.0 ± 8.3 |
Initial text is first divided into paragraphs using the regular expression \n+
, then into sentences based on the delimiters: !?.
.
After splitting the text into sentences, the core logic of the algorithm begins.
The algorithm is based on the sliding window technique with a dynamically changing size. It sets the pointer
Then the distance
If the distance
Then the
The algorithm continues until:
where
The chunking logic described above is depicted in Figures 1–2.
Fig. 1. Finding the cosine similarity between
Fig. 2. Formation of document
When expanding the window
Fig. 3. Drift of semantic meaning as the window expands
Adding a new sentence moves the window
Using a high threshold value of
For additional control over the documents being created, the
For more details, check the research paper.
The developed adjustment algorithm is based on the binary search algorithm. The source text is divided into paragraphs
and then into sentences, just like at the start of the Semantic Splitting
algorithm.
Next, based on the specified size of the static window
For each obtained chunk
Next, the distance
A dictionary is formed from the obtained distances and chunks:
The created dictionary
After receiving the sorted dictionary
Fig. 4. The process of finding the threshold
Using binary search-based tuning, the threshold value of the cosine similarity can be adjusted based on the data
to be split. In addition, since the static window size of
For more details, check the research paper.