MostSimilarDocuments Pipeline embeddings do not match DocumentSearchPipline embeddings with same text #3298

mwade-noetic · 2022-09-29T18:19:19Z

mwade-noetic
Sep 29, 2022

I am utilizing the MostSimilarDocuments Pipeline as well as teh DocumentSearchPipeline. With the MostSimilarDocuments (MSD) we pass in the document id that we want to find similar documents for. However, If I take that same text and pass it as the query in DocumentSearch pipeline I am getting different document embeddings and I end up with somewhat different results and when looking closer I realized that the embeddings that are created for the text do not match the embeddings that were calculated for the same text in the MSD.

I am using the sentence-transformers/all-mpnet-base-v2 language model and I have uploaded the documents into an ElasticsearchDocumentStore. Here is the code I use to retrieve the most similar documents to one document in the index. The results are excellent and I get a score of 1.0 when the text is 100% the same (itself and a few duplicates):

document_store` = ElasticsearchDocumentStore(host=configs['url'],  port=configs['port'], 
                                             username=configs['user'], 
                                             password=configs['secret'],
                                             index="haystack_all-mpnet",
                                             embedding_field="embedding",
                                             similarity='cosine',
                                             ca_certs=configs['ca_certs'],
                                             verify_certs=False,
                                             scheme='https')

msd_pipeline = MostSimilarDocumentsPipeline(document_store=document_store)
mpnet_result = msd_pipeline.run(document_ids=['b8446f55abd4ed7625ebb8b294236c64'], > top_k=50)'

For the document search pipeline, I have implemented the following code run immediately after the above code:

retriever = EmbeddingRetriever(document_store=document_store, 
                                                    embedding_model=embedding_model,
                                                    batch_size=256, 
                                                    max_seq_len=256, 
                                                    top_k=100, 
                                                    use_gpu=True)

doc_search_pipeline = DocumentSearchPipeline(retriever=retriever)
"""Take the text from the very first document returned in the most_similar and use that as the query.'"""
results = doc_search_pipeline.run(query=mpnet_result[0][0].content, debug=True)

The results are similar but not the same. The document that the text comes from got a score of 0.96 instead of 1.0.
Is there a reason for this difference? NOTE: The text itself about 80 characters so it shouldn't be the settings in the embedding retriever.

Any insight as to why these results are different would be appreciated.

Thanks

Answered by julian-risch

Oct 4, 2022

Hi @mwade-noetic One explanation for the difference that you see in the generated document embeddings is that a document's meta data and in particular its title (name) is taken into account when generating an embedding (and therefore also in your MostSimilarDocumentsPipeline):

haystack/haystack/nodes/retriever/_embedding_encoder.py

Line 188 in 2298155

     passages = [[d.meta["name"] if d.meta and "name" in d.meta else "", d.content] for d in docs] # type: ignore  

 

This step is not used for your query text though when you run the following in your DocumentSearchPipeline:
doc_search_pipeline.run(query=mpnet_result[0][0].content, debug=True)

Even if there is no document name, wha…

View full answer

julian-risch · 2022-10-04T11:26:56Z

julian-risch
Oct 4, 2022
Maintainer

Hi @mwade-noetic One explanation for the difference that you see in the generated document embeddings is that a document's meta data and in particular its title (name) is taken into account when generating an embedding (and therefore also in your MostSimilarDocumentsPipeline):

haystack/haystack/nodes/retriever/_embedding_encoder.py

Line 188 in 2298155

    
           passages = [[d.meta["name"] if d.meta and "name" in d.meta else "", d.content] for d in docs]  # type: ignore

This step is not used for your query text though when you run the following in your DocumentSearchPipeline:
doc_search_pipeline.run(query=mpnet_result[0][0].content, debug=True)

Even if there is no document name, what is used for creating the document's embedding is something like
["", d.content]

1 reply

julian-risch Oct 14, 2022
Maintainer

@mwade-noetic For your information, we will most likely change that part of the code in one of our next sprints. As a result, the generated embeddings should then be the same when you run your two example code snippets.
Tagging @mayankjobanputra because it might be a good issue for you to work on. 🙂

mayankjobanputra · 2022-10-18T21:05:27Z

mayankjobanputra
Oct 18, 2022

Yes @julian-risch. I have tested similar use-case for my PR #3368

@mwade-noetic I will update here once my PR is merged and maybe you can confirm if it solves your issue. We would be happy to look into it if it doesn't fix :)

2 replies

mwade-noetic Oct 19, 2022
Author

That is a nice change, I would be more than happy to take a look at it when you have it completed. I will be on the look out for the update.

mayankjobanputra Nov 4, 2022

Hi @mwade-noetic, I think if you use the latest release (1.11 or higher), you should be able to run your use-cases again :) Please let us know here if you still face any problems!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MostSimilarDocuments Pipeline embeddings do not match DocumentSearchPipline embeddings with same text #3298

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MostSimilarDocuments Pipeline embeddings do not match DocumentSearchPipline embeddings with same text #3298

Uh oh!

mwade-noetic Sep 29, 2022

Replies: 2 comments · 3 replies

Uh oh!

julian-risch Oct 4, 2022 Maintainer

Uh oh!

julian-risch Oct 14, 2022 Maintainer

Uh oh!

mayankjobanputra Oct 18, 2022

Uh oh!

mwade-noetic Oct 19, 2022 Author

Uh oh!

mayankjobanputra Nov 4, 2022

mwade-noetic
Sep 29, 2022

Replies: 2 comments 3 replies

julian-risch
Oct 4, 2022
Maintainer

julian-risch Oct 14, 2022
Maintainer

mayankjobanputra
Oct 18, 2022

mwade-noetic Oct 19, 2022
Author