Skip to content

How to make PreProcessor explode each input document into many output Documents with a single passage each? #3515

Answered by ZanSara
dkaumanns asked this question in Questions
Discussion options

You must be logged in to vote

Hey @kaumanns, those options are indeed a bit mislabeled. I normally recommend to use either split_by='sentence' or split_by='word' with split_respect_sentence_boundary=True. I'm not really aware of the usecase for split_by='passage' to be really honest.

As you can see in the source, it should split the documents by each \n\n found, but I believe many file converters do not treat whitespace properly and will collapse such strings into single \n, making split_by='passage' fail to work.

There's a related discussion/issue here: #3464 #3498

Replies: 2 comments 8 replies

Comment options

You must be logged in to vote
2 replies
@dkaumanns
Comment options

@ZanSara
Comment options

Answer selected by dkaumanns
Comment options

You must be logged in to vote
6 replies
@bogdankostic
Comment options

@dkaumanns
Comment options

@dkaumanns
Comment options

@bogdankostic
Comment options

@dkaumanns
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants