-
PreProcessor shall split by passages (defined as - name: Preprocessor
type: PreProcessor
params:
split_by: passage
split_length: 1 Per documentation, I expect this config to explode each input document into many passages, each passage yielding its own output 1 input document -> many output But it doesn't work. When I run this over a single input document with several (confirmed) passages, it produces just a single 1 input document -> 1 output When I try this: - name: Preprocessor
type: PreProcessor
params:
split_by: passage
split_length: 1000 ... it produces, again, a single 1 input document -> 1 output This behaviour seems incongruent with the official docs:
I got the expected behaviour when I used If that is the way to do it, I don't understand the point of the What is the proper way? Do I have to write my own custom node to pre-split the incoming documents? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 8 replies
-
Hey @kaumanns, those options are indeed a bit mislabeled. I normally recommend to use either As you can see in the source, it should split the documents by each |
Beta Was this translation helpful? Give feedback.
-
Hey @kaumanns, Executing the following code snippet works as expected for me: from haystack.nodes import PreProcessor
from haystack import Document
TEXT = """
This is a sample sentence in paragraph_1. This is a sample sentence in paragraph_1. This is a sample sentence in
paragraph_1. This is a sample sentence in paragraph_1. This is a sample sentence in paragraph_1.
This is a sample sentence in paragraph_2. This is a sample sentence in paragraph_2. This is a sample sentence in
paragraph_2. This is a sample sentence in paragraph_2. This is a sample sentence in paragraph_2.
This is a sample sentence in paragraph_3. This is a sample sentence in paragraph_3. This is a sample sentence in
paragraph_3. This is a sample sentence in paragraph_3. This is to trick the test with using an abbreviation like Dr.
in the sentence.
"""
single_document = Document(content=TEXT)
preprocessor = PreProcessor(split_by="passage", split_length=1, split_respect_sentence_boundary=False)
split_documents = preprocessor.process(single_document) |
Beta Was this translation helpful? Give feedback.
Hey @kaumanns, those options are indeed a bit mislabeled. I normally recommend to use either
split_by='sentence'
orsplit_by='word'
withsplit_respect_sentence_boundary=True
. I'm not really aware of the usecase forsplit_by='passage'
to be really honest.As you can see in the source, it should split the documents by each
\n\n
found, but I believe many file converters do not treat whitespace properly and will collapse such strings into single\n
, makingsplit_by='passage'
fail to work.There's a related discussion/issue here: #3464 #3498