The logic regarding separator retention in RecursiveCharacterTextSplitter partitioning can be problematic #26311
Replies: 2 comments
-
Hey @xiuxiuxius! I'm here to help you with your bug, answer questions, and even guide you on becoming a contributor. Let's dive into the issue you're facing! To address the issue with separator retention in the def _split_text(self, text: str, separators: List[str]) -> List[str]:
"""Split incoming text and return chunks."""
final_chunks = []
# Get appropriate separator to use
separator = separators[-1]
new_separators = []
for i, _s in enumerate(separators):
_separator = _s if self._is_separator_regex else re.escape(_s)
if _s == "":
separator = _s
break
if re.search(_separator, text):
separator = _s
new_separators = separators[i + 1 :]
break
_separator = separator if self._is_separator_regex else re.escape(separator)
splits = _split_text_with_regex(text, _separator, self._keep_separator)
# Now go merging things, recursively splitting longer texts.
_good_splits = []
_separator = separator if self._keep_separator else ""
for s in splits:
if self._length_function(s) < self._chunk_size:
_good_splits.append(s)
else:
if _good_splits:
merged_text = self._merge_splits(_good_splits, _separator)
final_chunks.extend(merged_text)
_good_splits = []
if not new_separators:
final_chunks.append(s)
else:
other_info = self._split_text(s, new_separators)
final_chunks.extend(other_info)
if _good_splits:
merged_text = self._merge_splits(_good_splits, _separator)
final_chunks.extend(merged_text)
return final_chunks This change ensures that |
Beta Was this translation helpful? Give feedback.
-
Oh, I see _split_text_with_regex Here, if self._keep_separator is True, then the separator is already reserved, not the join
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Checked other resources
Commit to Help
Example Code
Description
I think this sentence in the code:
If the separator needs to be retained, then '_separator' should equal the current separator. So the logic should be reversed:
This preserves the current separator when 'self._keep_separator' is set to 'True'.
System Info
System Information
Package Information
Packages not installed (Not Necessarily a Problem)
The following packages were not found:
Beta Was this translation helpful? Give feedback.
All reactions