Content deduplication - Reference #1496 #1641
sekh77
started this conversation in
Show and tell
Replies: 1 comment 1 reply
-
Hi @sekh77! The for query in most_similar_docs:
highest_scored_doc = query[0] |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
This is in reference to my earlier discussion #1496
I managed to get the "MostSimilarDocumentsPipeline" running for my document store. And could see duplicates being reported. Here's an example of the Document result object.
most_similar_docs = [{'text': '<>, 'score': 1.0, 'question': None, 'meta': {'_split_id': 0, 'name': 'file_0673.txt'}, 'embedding': None, 'id': 'e1eccfd26a6354b493a601bf966d2b2a'}, 'text': '<>, 'score': 0.93728964, 'question': None, 'meta': {'_split_id': 0, 'name': 'file_0781.txt'}, 'embedding': None, 'id': 'ea119020fb1dad657dbbef87e7419894'},}}]
I have 10 different entries most_similar_docs[0] to most_similar_docs[10] in the result object. Each entry has top_k=4 - so most_similar_docs[0] has 4 entries.
How do I loop through most_similar_docs, and generate a CSV report as follows:
File name, Score, Duplicate Files
file_0673.txt, 93.7%, 0781.txt
I tried in this way so far: print(list(map(lambda item: item.get('score', 'default value'), most_similar_docs)))
But I get the error: AttributeError: 'list' object has no attribute 'get'
Any help would be greatly appreciated?
Also as @bogdankostic suggested in one of his replies in discussion #1496 , I tried with this statement: most_similar_docs[0].score. But I get an error:
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'list' object has no attribute 'score'
Thanks,
Sekhar H.
Beta Was this translation helpful? Give feedback.
All reactions