MostSimilarDocuments Pipeline embeddings do not match DocumentSearchPipline embeddings with same text #3298
-
I am utilizing the MostSimilarDocuments Pipeline as well as teh DocumentSearchPipeline. With the MostSimilarDocuments (MSD) we pass in the document id that we want to find similar documents for. However, If I take that same text and pass it as the query in DocumentSearch pipeline I am getting different document embeddings and I end up with somewhat different results and when looking closer I realized that the embeddings that are created for the text do not match the embeddings that were calculated for the same text in the MSD. I am using the sentence-transformers/all-mpnet-base-v2 language model and I have uploaded the documents into an ElasticsearchDocumentStore. Here is the code I use to retrieve the most similar documents to one document in the index. The results are excellent and I get a score of 1.0 when the text is 100% the same (itself and a few duplicates):
For the document search pipeline, I have implemented the following code run immediately after the above code:
The results are similar but not the same. The document that the text comes from got a score of 0.96 instead of 1.0. Any insight as to why these results are different would be appreciated. Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
Hi @mwade-noetic One explanation for the difference that you see in the generated document embeddings is that a document's meta data and in particular its title (name) is taken into account when generating an embedding (and therefore also in your MostSimilarDocumentsPipeline): This step is not used for your query text though when you run the following in your DocumentSearchPipeline: Even if there is no document name, what is used for creating the document's embedding is something like |
Beta Was this translation helpful? Give feedback.
-
Yes @julian-risch. I have tested similar use-case for my PR #3368 @mwade-noetic I will update here once my PR is merged and maybe you can confirm if it solves your issue. We would be happy to look into it if it doesn't fix :) |
Beta Was this translation helpful? Give feedback.
Hi @mwade-noetic One explanation for the difference that you see in the generated document embeddings is that a document's meta data and in particular its title (name) is taken into account when generating an embedding (and therefore also in your MostSimilarDocumentsPipeline):
haystack/haystack/nodes/retriever/_embedding_encoder.py
Line 188 in 2298155
This step is not used for your query text though when you run the following in your DocumentSearchPipeline:
doc_search_pipeline.run(query=mpnet_result[0][0].content, debug=True)
Even if there is no document name, wha…