Skip to content

Evaluation Metrics for Retrieval should have flexible comparison attributes to allow for consistent evaluation across different chunking strategies #9331

Open
@deep-rloebbert

Description

@deep-rloebbert

Is your feature request related to a problem? Please describe.
Doing evaluations with a given ground truth set of documents where the user put manual effort into curating it. The ground truth is based on retrieving the correct page from a document (joined id: (file_id, page_number)).

In all Document evaluators, doc.content is used for comparison.

ground_truth_contents = [doc.content for doc in ground_truth if doc.content is not None]

Describe the solution you'd like
I would like to define how the comparison should be done via a comparison field:

@component
class DocumentMetaMRREvaluator:
    """
    Evaluator that calculates the mean reciprocal rank of the retrieved documents.

    MRR measures how high the first retrieved document is ranked.
    Each question can have multiple ground truth documents and multiple retrieved documents.

    `DocumentMRREvaluator` doesn't normalize its inputs, the `DocumentCleaner` component
    should be used to clean and normalize the documents before passing them to this evaluator.

    Usage example:
    ```python
    from haystack import Document
    from haystack.components.evaluators import DocumentMRREvaluator

    evaluator = DocumentMRREvaluator()
    result = evaluator.run(
        ground_truth_documents=[
            [Document(content="France")],
            [Document(content="9th century"), Document(content="9th")],
        ],
        retrieved_documents=[
            [Document(content="France")],
            [Document(content="9th century"), Document(content="10th century"), Document(content="9th")],
        ],
    )
    print(result["individual_scores"])
    # [1.0, 1.0]
    print(result["score"])
    # 1.0
    ```
    """

    def __init__(self,
                 comparison_field: Callable[[Document], Hashable] = lambda doc: doc.content):
        self.comparison_field = comparison_field

    # Refer to https://www.pinecone.io/learn/offline-evaluation/ for the algorithm.
    @component.output_types(score=float, individual_scores=List[float])
    def run(
        self, ground_truth_documents: List[List[Document]], retrieved_documents: List[List[Document]]
    ) -> Dict[str, Any]:
        """
        Run the DocumentMRREvaluator on the given inputs.

        `ground_truth_documents` and `retrieved_documents` must have the same length.

        :param ground_truth_documents:
            A list of expected documents for each question.
        :param retrieved_documents:
            A list of retrieved documents for each question.
        :returns:
            A dictionary with the following outputs:
            - `score` - The average of calculated scores.
            - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents how high the first retrieved
                document is ranked.
        """
        if len(ground_truth_documents) != len(retrieved_documents):
            msg = "The length of ground_truth_documents and retrieved_documents must be the same."
            raise ValueError(msg)

        individual_scores = []

        for ground_truth, retrieved in zip(ground_truth_documents, retrieved_documents):
            reciprocal_rank = 0.0

            ground_truth_contents = [self.comparison_field(doc) for doc in ground_truth if self.comparison_field(doc)]
            for rank, retrieved_document in enumerate(retrieved):
                if self.comparison_field(retrieved_document) is None:
                    continue
                if self.comparison_field(retrieved_document) in ground_truth_contents:
                    reciprocal_rank = 1 / (rank + 1)
                    break
            individual_scores.append(reciprocal_rank)

        if ground_truth_contents:
            score = sum(individual_scores) / len(ground_truth_documents)

        else:
            score = 0.0
            print("Warning: No relevant documents with comparison doc field found in ground truth. Returning MRR score of 0.0.")

        return {"score": score, "individual_scores": individual_scores}

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priority, add to the next sprint if no P1 available

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions