Open
Description
Is your feature request related to a problem? Please describe.
Doing evaluations with a given ground truth set of documents where the user put manual effort into curating it. The ground truth is based on retrieving the correct page from a document (joined id: (file_id, page_number)).
In all Document evaluators, doc.content is used for comparison.
Describe the solution you'd like
I would like to define how the comparison should be done via a comparison field:
@component
class DocumentMetaMRREvaluator:
"""
Evaluator that calculates the mean reciprocal rank of the retrieved documents.
MRR measures how high the first retrieved document is ranked.
Each question can have multiple ground truth documents and multiple retrieved documents.
`DocumentMRREvaluator` doesn't normalize its inputs, the `DocumentCleaner` component
should be used to clean and normalize the documents before passing them to this evaluator.
Usage example:
```python
from haystack import Document
from haystack.components.evaluators import DocumentMRREvaluator
evaluator = DocumentMRREvaluator()
result = evaluator.run(
ground_truth_documents=[
[Document(content="France")],
[Document(content="9th century"), Document(content="9th")],
],
retrieved_documents=[
[Document(content="France")],
[Document(content="9th century"), Document(content="10th century"), Document(content="9th")],
],
)
print(result["individual_scores"])
# [1.0, 1.0]
print(result["score"])
# 1.0
```
"""
def __init__(self,
comparison_field: Callable[[Document], Hashable] = lambda doc: doc.content):
self.comparison_field = comparison_field
# Refer to https://www.pinecone.io/learn/offline-evaluation/ for the algorithm.
@component.output_types(score=float, individual_scores=List[float])
def run(
self, ground_truth_documents: List[List[Document]], retrieved_documents: List[List[Document]]
) -> Dict[str, Any]:
"""
Run the DocumentMRREvaluator on the given inputs.
`ground_truth_documents` and `retrieved_documents` must have the same length.
:param ground_truth_documents:
A list of expected documents for each question.
:param retrieved_documents:
A list of retrieved documents for each question.
:returns:
A dictionary with the following outputs:
- `score` - The average of calculated scores.
- `individual_scores` - A list of numbers from 0.0 to 1.0 that represents how high the first retrieved
document is ranked.
"""
if len(ground_truth_documents) != len(retrieved_documents):
msg = "The length of ground_truth_documents and retrieved_documents must be the same."
raise ValueError(msg)
individual_scores = []
for ground_truth, retrieved in zip(ground_truth_documents, retrieved_documents):
reciprocal_rank = 0.0
ground_truth_contents = [self.comparison_field(doc) for doc in ground_truth if self.comparison_field(doc)]
for rank, retrieved_document in enumerate(retrieved):
if self.comparison_field(retrieved_document) is None:
continue
if self.comparison_field(retrieved_document) in ground_truth_contents:
reciprocal_rank = 1 / (rank + 1)
break
individual_scores.append(reciprocal_rank)
if ground_truth_contents:
score = sum(individual_scores) / len(ground_truth_documents)
else:
score = 0.0
print("Warning: No relevant documents with comparison doc field found in ground truth. Returning MRR score of 0.0.")
return {"score": score, "individual_scores": individual_scores}