Need help reviewing my configuration #2889

felixriehm · 2022-07-27T09:04:33Z

felixriehm
Jul 27, 2022

Hey everyone

i have built a chatbot with haystack by following the tutorials on the website and i now got some question about the correctness of my setup.

I'm using the rest api with a pipeline yml file. There, I defined a retriever of type BM25Retriever and a reader of type FARMReader which uses deepset/roberta-base-squad2 as model. The pipeline is simple: query -> retriever -> reader. Now i wonder if this configuration works fine with the way i have written data to elasticsearch?

I have written a python script to upload documents to elasticsearch. This script is divided in two sections. The first section writes general documents to elasticsearch with

document_store = ElasticsearchDocumentStore(host="localhost", port=9200, username="", password="", index="document", similarity="dot_product")
document_store.write_documents(docs)

After that i update the embedding with document_store.update_embeddings(retriever). As a retriever is use DensePassageRetriever with

query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",

In the second section i write FAQ data to elasticsearch which follows this tutorial.

# Upload FAQ docs to elastic search
document_store_FAQ = ElasticsearchDocumentStore(host="localhost", username="", password="",
                                            index="faq",
                                            similarity="cosine",
                                            embedding_field="question_emb",
                                            embedding_dim=384,
                                            excluded_meta_data=["question_emb"])
print(document_store_FAQ.get_all_documents())
document_store_FAQ.delete_documents()

retriever = EmbeddingRetriever(document_store=document_store_FAQ, embedding_model="sentence-transformers/all-MiniLM-L6-v2", use_gpu=False)

df = pd.read_csv(str(sys.argv[2]))
# Minimal cleaning
df.fillna(value="", inplace=True)
df["question"] = df["question"].apply(lambda x: x.strip())
print(df.head())

# Get embeddings for our questions from the FAQs
questions = list(df["question"].values)
df["question_emb"] = retriever.embed_queries(texts=questions)
df = df.rename(columns={"question": "content"})

# Convert Dataframe to list of dicts and index them in our DocumentStore
docs_to_index = df.to_dict(orient="records")
document_store_FAQ.write_documents(docs_to_index)
#document_store_FAQ.update_embeddings(retriever)
print(document_store_FAQ.get_all_documents())

The reasons i am confused with this setup:

When I'm writing the general documents to elasticsearch I'm using DensePassageRetriever to embed the data but when i do a query i use BM25Retriever (which is sparse). Is this intended to work correctly or are the embedding values completely different and don't work together?
My intention with separating general documents and FAQ data is to improve the results. However, I'm not sure if this takes effect. I'm using sentence-transformers/all-MiniLM-L6-v2 embedded retriever when i write FAQ data to elasticsearch but when i do a query it will take the deepset/roberta-base-squad2 FARMReader (is this reader suited for FAQ data?) and in addition to that i retrieve the data with BM25Retriever which is, like in my first question, not the same as the sentence-transformers/all-MiniLM-L6-v2 embedded retriever that i used initially to write to elasticsearch. So there can't be a positive effect? Do i have to use two retrievers and two readers (one for general documents and one for FAQ data) and then merge the results? And what is the purpose of using two different indices for writing to elasticsearch (in my case 'faq' and 'document'). Can't i just write all documents to elasticsearch with the same index?

Then there are other general question that i have:

When I do a query there is a 'answers' and a 'documents' attribute. How can i control that output so it only shows me 'answers' or 'documents'?
When I do a query usually the scores of 'documents' are higher than the 'answers' scores. Is that normal? If yes, why? Are the scores of 'documents' and 'answers' comparable?

Best,
Felix

Answered by sjrl

Jul 27, 2022

To answer your first question

When I'm writing the general documents to elasticsearch I'm using DensePassageRetriever to embed the data but when i do a query i use BM25Retriever (which is sparse). Is this intended to work correctly or are the embedding values completely different and don't work together?

The BM25Retriever actually does not use vector embeddings when retrieving documents. It's based off of TF-IDF (more details can be found here). So in short this combination is intended to work, but you are currently not using the embeddings for document retrieval. If you never plan on using a DensePassageRetriever in your query pipeline then you do not need to run document_store.updat…

View full answer

sjrl · 2022-07-27T09:34:33Z

sjrl
Jul 27, 2022
Maintainer

To answer your first question

When I'm writing the general documents to elasticsearch I'm using DensePassageRetriever to embed the data but when i do a query i use BM25Retriever (which is sparse). Is this intended to work correctly or are the embedding values completely different and don't work together?

The BM25Retriever actually does not use vector embeddings when retrieving documents. It's based off of TF-IDF (more details can be found here). So in short this combination is intended to work, but you are currently not using the embeddings for document retrieval. If you never plan on using a DensePassageRetriever in your query pipeline then you do not need to run document_store.update_embeddings() after writing your documents to the document store. If you would like to use DensePassageRetrieval in your query pipeline then you should replace the BM25Retriever with the DensePassageRetrieval.

1 reply

felixriehm Jul 28, 2022
Author

If you would like to use DensePassageRetrieval in your query pipeline then you should replace the BM25Retriever with the DensePassageRetrieval.

It should be DensePassageRetriever not DensePassageRetrieval.

sjrl · 2022-07-27T09:46:28Z

sjrl
Jul 27, 2022
Maintainer

My intention with separating general documents and FAQ data is to improve the results.

I'm not entirely sure what you mean by improving the results in this case. A reason I could see for having two separate indices for general data and FAQ data is if you have two separate query pipelines for general data and FAQ data.

Do i have to use two retrievers and two readers (one for general documents and one for FAQ data) and then merge the results?And what is the purpose of using two different indices for writing to elasticsearch (in my case 'faq' and 'document'). Can't i just write all documents to elasticsearch with the same index?

No, you do not have to use two separate retrievers and readers. Yes, you could write both the general documents and FAQ data to the same document store using the same index. If you do this then your query pipeline will return answers that could either be from the general data or the FAQ data.

8 replies

sjrl Jul 27, 2022
Maintainer

Hi @felixriehm you can inspect ready-made pipelines. Check out the docs here: https://haystack.deepset.ai/components/pipelines specifically under the header "Inspect a Pipeline".

sjrl Jul 27, 2022
Maintainer

So you are saying that at the moment only the general documents are being searched? I guess i have to pass an argument to the query in order to search for index=faq?

Not exactly. You decide which index to search when you create your document_store(index=CHOOSE_YOUR_INDEX). So if your FAQ documents and general documents are stored in separate indices you will need to create two separate document_store which can then be put into two separate pipelines.

felixriehm Jul 27, 2022
Author

Not exactly. You decide which index to search when you create your document_store(index=CHOOSE_YOUR_INDEX). So if your FAQ documents and general documents are stored in separate indices you will need to create two separate document_store which can then be put into two separate pipelines.

Ok, and how would i do that with the rest api pipeline yml? With a python script i would define a pipeline with a receiver that is associated with a elasticsearch instance. I use the default yml file from the haystack repo.

My document stores:

document_store = ElasticsearchDocumentStore(host="localhost", port=9200, username="", password="", index="document", similarity="dot_product")
document_store_FAQ = ElasticsearchDocumentStore(host="localhost", username="", password="", index="faq", similarity="cosine",  embedding_field="question_emb", embedding_dim=384, excluded_meta_data=["question_emb"])

Do i have to add a index parameter there like this in the yml file?

components:    # define all the building-blocks for Pipeline
- name: DocumentStore
    type: ElasticsearchDocumentStore
    params:
        host: localhost
        index: faq
   ...

At the moment i dont have specified a index there. I assume it defaults to some value? I have just checked. I can not find any FAQ data but the general data can be found.

sjrl Jul 28, 2022
Maintainer

Do i have to add a index parameter there like this in the yml file?

Yes I believe that is what you need to do.

I assume it defaults to some value?

Yes, the index defaults to the value "document" if you do not specify it.

I have just checked. I can not find any FAQ data but the general data can be found.

To clarify, you could not find any FAQ data after you added the index: faq to your yml file?

felixriehm Jul 28, 2022
Author

To clarify, you could not find any FAQ data after you added the index: faq to your yml file?

I tried and it works! 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Need help reviewing my configuration #2889

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 9 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Need help reviewing my configuration #2889

Uh oh!

felixriehm Jul 27, 2022

Replies: 2 comments · 9 replies

Uh oh!

Uh oh!

sjrl Jul 27, 2022 Maintainer

Uh oh!

felixriehm Jul 28, 2022 Author

Uh oh!

sjrl Jul 27, 2022 Maintainer

Uh oh!

sjrl Jul 27, 2022 Maintainer

Uh oh!

sjrl Jul 27, 2022 Maintainer

Uh oh!

felixriehm Jul 27, 2022 Author

Uh oh!

sjrl Jul 28, 2022 Maintainer

Uh oh!

felixriehm Jul 28, 2022 Author

felixriehm
Jul 27, 2022

Replies: 2 comments 9 replies

sjrl
Jul 27, 2022
Maintainer

felixriehm Jul 28, 2022
Author

sjrl
Jul 27, 2022
Maintainer

sjrl Jul 27, 2022
Maintainer

sjrl Jul 27, 2022
Maintainer

felixriehm Jul 27, 2022
Author

sjrl Jul 28, 2022
Maintainer

felixriehm Jul 28, 2022
Author