What is the best approach for DocumentStore if Q&A is just for a single document #5157

Rajmehta123 · 2023-06-16T16:15:13Z

Rajmehta123
Jun 16, 2023

I have a use case where I need Q&A on a single document. There will be 1000 documents in the pipe with the same set of questions to extract. I tried the below approach but it leads to memory leak.

What is the best way to use InMemoryStore in order to serve only one document at a time? I still want to use the PDFConverter module and split it by passages but only one document at a time.
Once the answers are extracted for a set of questions from that document, it needs to be erased from Memory and repeat the process for the second document.

At present, I created a for loop to iterate over all documents -> process each document InMemory -> Delete all documents from memory -> process the second document from the loop.

`        if ques_list == None:
             ques_list = ['Ques1', 'Ques2', 'Ques3']
        document_store = InMemoryDocumentStore()
        
#Wait until other processes finish the document if using Flask service serving multiple threads
        time_elapsed = 0     
        while len(document_store.get_all_documents()) != 0:
            time.sleep(1)
            time_elapsed += 1
        document_store.delete_documents()

        d = converter.convert(pdf_path)
        d = processor.process(d)

        document_store.write_documents(d)

        retriever = DensePassageRetriever(document_store=document_store,query_embedding_model="facebook/dpr-question_encoder-single-nq-base",passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base")

        document_store.update_embeddings(retriever)

        pipeline_temp = ExtractiveQAPipeline(reader, retriever)

        for question in ques_list:
            prediction = pipeline_temp.run(query=question, params={"Retriever": {"top_k": 3}, "Reader": {"top_k": 4}})
`

ZanSara · 2023-06-19T09:35:37Z

ZanSara
Jun 19, 2023

My recommendation would be to use an Indexing Pipeline and then to use metadata filtering to get information from only one document at a time.

Indexing Pipeline: https://docs.haystack.deepset.ai/docs/pipelines#indexing-pipelines
Metadata filtering: https://docs.haystack.deepset.ai/docs/metadata-filtering

You also don't need to recreate the pipeline every time: that is what causes the memory leak I think. Just create it when you start the Flask server and reuse it.

0 replies

Rajmehta123 · 2023-06-20T16:43:57Z

Rajmehta123
Jun 20, 2023
Author

Can I customize the pipeline to input raw text?
p.add_node(component=text_converter, name="TextConverter", inputs=["Text"])

If not, how can I convert a raw text input to Document type with metadata as a randomly generated UUID? The metadata won't be helpful here as the input is just until the answer is found. Once found, I would be deleting the Document from Store.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What is the best approach for DocumentStore if Q&A is just for a single document #5157

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What is the best approach for DocumentStore if Q&A is just for a single document #5157

Uh oh!

Uh oh!

Rajmehta123 Jun 16, 2023

Replies: 2 comments

Uh oh!

ZanSara Jun 19, 2023

Uh oh!

Rajmehta123 Jun 20, 2023 Author

Rajmehta123
Jun 16, 2023

ZanSara
Jun 19, 2023

Rajmehta123
Jun 20, 2023
Author