How to make PreProcessor explode each input document into many output `Document`s with a single passage each? #3515

dkaumanns · 2022-11-02T09:14:25Z

dkaumanns
Nov 2, 2022

PreProcessor shall split by passages (defined as \n\n-delimited) with this config:

  - name: Preprocessor
    type: PreProcessor
    params:  
      split_by: passage  
      split_length: 1

Per documentation, I expect this config to explode each input document into many passages, each passage yielding its own output Document for DocumentStore:

1 input document -> many output Documents with 1 passage each

But it doesn't work.

When I run this over a single input document with several (confirmed) passages, it produces just a single Document with a single passage:

1 input document -> 1 output Document with 1 passage

When I try this:

  - name: Preprocessor
    type: PreProcessor
    params:  
      split_by: passage  
      split_length: 1000

... it produces, again, a single Document, but this time with many (all) passages:

1 input document -> 1 output Document with many passages

This behaviour seems incongruent with the official docs:

I got the expected behaviour when I used convert_files_to_docs() with the split_paragraphs=True option prior to PrePreprocessor (as documented here: https://haystack.deepset.ai/tutorials/08_preprocessing).

If that is the way to do it, I don't understand the point of the PreProcessor options.

What is the proper way? Do I have to write my own custom node to pre-split the incoming documents?

Answered by ZanSara

Nov 2, 2022

Hey @kaumanns, those options are indeed a bit mislabeled. I normally recommend to use either split_by='sentence' or split_by='word' with split_respect_sentence_boundary=True. I'm not really aware of the usecase for split_by='passage' to be really honest.

As you can see in the source, it should split the documents by each \n\n found, but I believe many file converters do not treat whitespace properly and will collapse such strings into single \n, making split_by='passage' fail to work.

There's a related discussion/issue here: #3464 #3498

View full answer

ZanSara · 2022-11-02T11:52:43Z

ZanSara
Nov 2, 2022

Hey @kaumanns, those options are indeed a bit mislabeled. I normally recommend to use either split_by='sentence' or split_by='word' with split_respect_sentence_boundary=True. I'm not really aware of the usecase for split_by='passage' to be really honest.

As you can see in the source, it should split the documents by each \n\n found, but I believe many file converters do not treat whitespace properly and will collapse such strings into single \n, making split_by='passage' fail to work.

There's a related discussion/issue here: #3464 #3498

2 replies

dkaumanns Nov 2, 2022
Author

Thank you for the answer. I take that I should pre-split with my own logic.

FYI: the converter is not the issue at least in this case. I feed in plain text files.

ZanSara Nov 3, 2022

Alright! If you feel like it, I would be happy to see which algorithm you'll be using for splitting. We'd like to keep improving the PreProcessor 😊

bogdankostic · 2022-11-02T18:36:25Z

bogdankostic
Nov 2, 2022

Hey @kaumanns,
I wonder why you're not getting an error when trying to split your Document with the given configuration. When setting split_by to "passage" you need to explicitly set split_respect_sentence_boundaryto False, otherwise we raise NotImplementedError("'split_respect_sentence_boundary=True' is only compatible with split_by='word'.").

Executing the following code snippet works as expected for me:

from haystack.nodes import PreProcessor
from haystack import Document

TEXT = """
This is a sample sentence in paragraph_1. This is a sample sentence in paragraph_1. This is a sample sentence in
paragraph_1. This is a sample sentence in paragraph_1. This is a sample sentence in paragraph_1.

This is a sample sentence in paragraph_2. This is a sample sentence in paragraph_2. This is a sample sentence in
paragraph_2. This is a sample sentence in paragraph_2. This is a sample sentence in paragraph_2.

This is a sample sentence in paragraph_3. This is a sample sentence in paragraph_3. This is a sample sentence in
paragraph_3. This is a sample sentence in paragraph_3. This is to trick the test with using an abbreviation like Dr.
in the sentence.
"""

single_document = Document(content=TEXT)
preprocessor = PreProcessor(split_by="passage", split_length=1, split_respect_sentence_boundary=False)
split_documents = preprocessor.process(single_document)

6 replies

bogdankostic Nov 3, 2022

This is indeed quite an unexpected behavior. Could you maybe show your whole config file? Also, could you try to run just the PreProcessor and look what kind of Documents are returned there? This would help us to narrow the problem down to either the PreProcessor or the OpenSearchDocumentStore.

dkaumanns Nov 3, 2022
Author

My full pipeline config. I will run the experiment to get the data point you requested.

All this is running from a fork synced with current latest fc551b9

# To allow your IDE to autocomplete and validate your YAML pipelines, name them as <name of your choice>.haystack-pipeline.yml

version: ignore

components:    # define all the building-blocks for Pipeline
  - name: DocumentStore
    type: OpenSearchDocumentStore
    params:
      host: localhost
      scheme: http
      verify_certs: false
      analyzer: german
      username: admin
      password: admin
      index: document
      label_index: label
  - name: Retriever
    type: BM25Retriever
    params:
      document_store: DocumentStore    # params can reference other components defined in the YAML
      top_k: 20
  - name: Reader       # custom-name for the component; helpful for visualization & debugging
    type: FARMReader    # Haystack Class name for the component
    params:
      model_name_or_path: deepset/gelectra-large-germanquad
      context_window_size: 500
      return_no_answer: true
      use_gpu: true
  - name: TextFileConverter
    type: TextConverter
  - name: PDFFileConverter
    type: PDFToTextConverter
  - name: Preprocessor
    type: PreProcessor
    params:  # https://docs.haystack.deepset.ai/reference/preprocessor-api
      split_by: word  # Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
      split_length: 200  # Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by -> "sentence", then each output document will have 10 sentences.
      split_overlap: 20  # Sets the amount of overlap between two adjacent documents after a split. Setting this to a positive number essentially enables the sliding window approach.
      split_respect_sentence_boundary: true  # Whether to split in partial sentences if split_by -> word. If set to True, the individual split will always have complete sentences & the number of words will be <= split_length.
#      split_by: word  # Unit for splitting the document. Can be "word", "sentence", or "passage". Set to None to disable splitting.
#      split_length: 1000  # Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by -> "sentence", then each output document will have 10 sentences.
#      split_respect_sentence_boundary: true  # Whether to split in partial sentences if split_by -> word. If set to True, the individual split will always have complete sentences & the number of words will be <= split_length.
      language: de  # The language used by "nltk.tokenize.sent_tokenize" in iso639 format.
#      add_page_number: true  # Add the number of the page a paragraph occurs in to the Document's meta field "page".
      clean_empty_lines: true  # Remove more than two empty lines in the text.
      clean_whitespace: true  # Strip whitespaces before or after each line in the text.
  - name: FileTypeClassifier
    type: FileTypeClassifier
  - name: LanguageDetector
    type: LanguageDetector

pipelines:
  - name: query    # a sample extractive-qa Pipeline
    nodes:
      - name: LanguageDetector
        inputs: [Query]
      - name: Retriever
        inputs: [LanguageDetector]
      - name: Reader
        inputs: [Retriever]
  - name: indexing
    nodes:
      - name: FileTypeClassifier
        inputs: [File]
      - name: TextFileConverter
        inputs: [FileTypeClassifier.output_1]
      - name: PDFFileConverter
        inputs: [FileTypeClassifier.output_2]
      - name: Preprocessor
        inputs: [PDFFileConverter, TextFileConverter]
      - name: LanguageDetector
        inputs: [Preprocessor]
      - name: Retriever
        inputs: [LanguageDetector]
      - name: DocumentStore
        inputs: [Retriever]

dkaumanns Nov 3, 2022
Author

@bogdankostic

Correction: I rewound to v1.10.0 (I assume that's the stable release). The weirdness persists.

I extended https://github.com/deepset-ai/haystack/blob/v1.10.0/haystack/nodes/preprocessor/base.py#L59 with logs about the document numbers in input and output:

    def run(  # type: ignore
        self,
        documents: Union[dict, Document, List[Union[dict, Document]]],
        clean_whitespace: Optional[bool] = None,
        clean_header_footer: Optional[bool] = None,
        clean_empty_lines: Optional[bool] = None,
        split_by: Literal["word", "sentence", "passage", None] = None,
        split_length: Optional[int] = None,
        split_overlap: Optional[int] = None,
        split_respect_sentence_boundary: Optional[bool] = None,
        id_hash_keys: Optional[List[str]] = None,
    ):
        processed_documents = self.process(
            documents=documents,
            clean_whitespace=clean_whitespace,
            clean_header_footer=clean_header_footer,
            clean_empty_lines=clean_empty_lines,
            split_by=split_by,
            split_length=split_length,
            split_overlap=split_overlap,
            split_respect_sentence_boundary=split_respect_sentence_boundary,
            id_hash_keys=id_hash_keys,
        )
        result = {"documents": processed_documents}
        logger.info("Number of input documents: " + str(len(documents)))
        logger.info("Number of output documents: " + str(len(processed_documents)))
        return result, "output_1"

I fed in a single document. Log:

haystack-api-1  | INFO:haystack:Number of input documents: 1
haystack-api-1  | INFO:haystack:Number of output documents: 178

So it looks like PreProcessor is working.

Now to check the index via Opensearch Dashboards:

GET /document/_count

Output:

{
  "count" : 1,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

All output docs but one are gone.

I verified that the index was purged prior to this experiment.

I re-ran this experiment with 2 input documents. The results are congruent: OpenSearch receives 2 docs where it should receive about 300 docs.

bogdankostic Nov 3, 2022

Thanks for doing the analysis! Good to know that the PreProcessor is working as intended.

My first guess on why you end up with only one Document in you OpenSearchDocumentStore is that all of the split Documents end up with the same ID. Document IDs inside a DocumentStore need to be unique, if we have duplicate IDs, they will be overwritten. Can you maybe check if the resulting IDs are unique?

dkaumanns Nov 3, 2022
Author

Solved. Thanks @bogdankostic

Turns out, the IDs were not unique. Why? The metadata of the input documents brought an _id field with them (from the source MongoDB). Each _id field was descended to all splits of the respective input document, causing OpenSearch to silently overwrite existing entries.

The API reference mentions an ominous self.update_existing_documents=True option to prevent this behaviour (?). But there is no obvious way to control that option from the pipeline yaml interface.

I removed the _id field from the metadata and all is good.

How to make PreProcessor explode each input document into many output Documents with a single passage each? #3515

Uh oh!

Uh oh!

dkaumanns Nov 2, 2022

Replies: 2 comments · 8 replies

Uh oh!

ZanSara Nov 2, 2022

Uh oh!

dkaumanns Nov 2, 2022 Author

Uh oh!

ZanSara Nov 3, 2022

Uh oh!

bogdankostic Nov 2, 2022

Uh oh!

bogdankostic Nov 3, 2022

Uh oh!

dkaumanns Nov 3, 2022 Author

Uh oh!

Uh oh!

dkaumanns Nov 3, 2022 Author

Uh oh!

bogdankostic Nov 3, 2022

Uh oh!

Uh oh!

dkaumanns Nov 3, 2022 Author

How to make PreProcessor explode each input document into many output `Document`s with a single passage each? #3515

dkaumanns
Nov 2, 2022

Replies: 2 comments 8 replies

ZanSara
Nov 2, 2022

dkaumanns Nov 2, 2022
Author

bogdankostic
Nov 2, 2022

dkaumanns Nov 3, 2022
Author

dkaumanns Nov 3, 2022
Author

dkaumanns Nov 3, 2022
Author