Evaluation Metrics for Retrieval should have flexible comparison attributes to allow for consistent evaluation across different chunking strategies

**Is your feature request related to a problem? Please describe.**
Doing evaluations with a given ground truth set of documents where the user put manual effort into curating it. The ground truth is based on retrieving the correct page from a document (joined id: (file_id, page_number)).

In all Document evaluators, doc.content is used for comparison.
https://github.com/deepset-ai/haystack/blob/e3f9da13d0df40b5898e480c725ee329ca175858/haystack/components/evaluators/document_mrr.py#L73

**Describe the solution you'd like**
I would like to define how the comparison should be done via a comparison field:
```python
@component
class DocumentMetaMRREvaluator:
    """
    Evaluator that calculates the mean reciprocal rank of the retrieved documents.

    MRR measures how high the first retrieved document is ranked.
    Each question can have multiple ground truth documents and multiple retrieved documents.

    `DocumentMRREvaluator` doesn't normalize its inputs, the `DocumentCleaner` component
    should be used to clean and normalize the documents before passing them to this evaluator.

    Usage example:
    ```python
    from haystack import Document
    from haystack.components.evaluators import DocumentMRREvaluator

    evaluator = DocumentMRREvaluator()
    result = evaluator.run(
        ground_truth_documents=[
            [Document(content="France")],
            [Document(content="9th century"), Document(content="9th")],
        ],
        retrieved_documents=[
            [Document(content="France")],
            [Document(content="9th century"), Document(content="10th century"), Document(content="9th")],
        ],
    )
    print(result["individual_scores"])
    # [1.0, 1.0]
    print(result["score"])
    # 1.0
    ```
    """

    def __init__(self,
                 comparison_field: Callable[[Document], Hashable] = lambda doc: doc.content):
        self.comparison_field = comparison_field

    # Refer to https://www.pinecone.io/learn/offline-evaluation/ for the algorithm.
    @component.output_types(score=float, individual_scores=List[float])
    def run(
        self, ground_truth_documents: List[List[Document]], retrieved_documents: List[List[Document]]
    ) -> Dict[str, Any]:
        """
        Run the DocumentMRREvaluator on the given inputs.

        `ground_truth_documents` and `retrieved_documents` must have the same length.

        :param ground_truth_documents:
            A list of expected documents for each question.
        :param retrieved_documents:
            A list of retrieved documents for each question.
        :returns:
            A dictionary with the following outputs:
            - `score` - The average of calculated scores.
            - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents how high the first retrieved
                document is ranked.
        """
        if len(ground_truth_documents) != len(retrieved_documents):
            msg = "The length of ground_truth_documents and retrieved_documents must be the same."
            raise ValueError(msg)

        individual_scores = []

        for ground_truth, retrieved in zip(ground_truth_documents, retrieved_documents):
            reciprocal_rank = 0.0

            ground_truth_contents = [self.comparison_field(doc) for doc in ground_truth if self.comparison_field(doc)]
            for rank, retrieved_document in enumerate(retrieved):
                if self.comparison_field(retrieved_document) is None:
                    continue
                if self.comparison_field(retrieved_document) in ground_truth_contents:
                    reciprocal_rank = 1 / (rank + 1)
                    break
            individual_scores.append(reciprocal_rank)

        if ground_truth_contents:
            score = sum(individual_scores) / len(ground_truth_documents)

        else:
            score = 0.0
            print("Warning: No relevant documents with comparison doc field found in ground truth. Returning MRR score of 0.0.")

        return {"score": score, "individual_scores": individual_scores}
``` 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation Metrics for Retrieval should have flexible comparison attributes to allow for consistent evaluation across different chunking strategies #9331

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation Metrics for Retrieval should have flexible comparison attributes to allow for consistent evaluation across different chunking strategies #9331

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions