Skip to content

Evaluation Metrics for Retrieval should have flexible comparison attributes to allow for consistent evaluation across different chunking strategies #9331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
deep-rloebbert opened this issue Apr 30, 2025 · 0 comments
Labels
P2 Medium priority, add to the next sprint if no P1 available

Comments

@deep-rloebbert
Copy link

Is your feature request related to a problem? Please describe.
Doing evaluations with a given ground truth set of documents where the user put manual effort into curating it. The ground truth is based on retrieving the correct page from a document (joined id: (file_id, page_number)).

In all Document evaluators, doc.content is used for comparison.

ground_truth_contents = [doc.content for doc in ground_truth if doc.content is not None]

Describe the solution you'd like
I would like to define how the comparison should be done via a comparison field:

@component
class DocumentMetaMRREvaluator:
    """
    Evaluator that calculates the mean reciprocal rank of the retrieved documents.

    MRR measures how high the first retrieved document is ranked.
    Each question can have multiple ground truth documents and multiple retrieved documents.

    `DocumentMRREvaluator` doesn't normalize its inputs, the `DocumentCleaner` component
    should be used to clean and normalize the documents before passing them to this evaluator.

    Usage example:
    ```python
    from haystack import Document
    from haystack.components.evaluators import DocumentMRREvaluator

    evaluator = DocumentMRREvaluator()
    result = evaluator.run(
        ground_truth_documents=[
            [Document(content="France")],
            [Document(content="9th century"), Document(content="9th")],
        ],
        retrieved_documents=[
            [Document(content="France")],
            [Document(content="9th century"), Document(content="10th century"), Document(content="9th")],
        ],
    )
    print(result["individual_scores"])
    # [1.0, 1.0]
    print(result["score"])
    # 1.0
    ```
    """

    def __init__(self,
                 comparison_field: Callable[[Document], Hashable] = lambda doc: doc.content):
        self.comparison_field = comparison_field

    # Refer to https://www.pinecone.io/learn/offline-evaluation/ for the algorithm.
    @component.output_types(score=float, individual_scores=List[float])
    def run(
        self, ground_truth_documents: List[List[Document]], retrieved_documents: List[List[Document]]
    ) -> Dict[str, Any]:
        """
        Run the DocumentMRREvaluator on the given inputs.

        `ground_truth_documents` and `retrieved_documents` must have the same length.

        :param ground_truth_documents:
            A list of expected documents for each question.
        :param retrieved_documents:
            A list of retrieved documents for each question.
        :returns:
            A dictionary with the following outputs:
            - `score` - The average of calculated scores.
            - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents how high the first retrieved
                document is ranked.
        """
        if len(ground_truth_documents) != len(retrieved_documents):
            msg = "The length of ground_truth_documents and retrieved_documents must be the same."
            raise ValueError(msg)

        individual_scores = []

        for ground_truth, retrieved in zip(ground_truth_documents, retrieved_documents):
            reciprocal_rank = 0.0

            ground_truth_contents = [self.comparison_field(doc) for doc in ground_truth if self.comparison_field(doc)]
            for rank, retrieved_document in enumerate(retrieved):
                if self.comparison_field(retrieved_document) is None:
                    continue
                if self.comparison_field(retrieved_document) in ground_truth_contents:
                    reciprocal_rank = 1 / (rank + 1)
                    break
            individual_scores.append(reciprocal_rank)

        if ground_truth_contents:
            score = sum(individual_scores) / len(ground_truth_documents)

        else:
            score = 0.0
            print("Warning: No relevant documents with comparison doc field found in ground truth. Returning MRR score of 0.0.")

        return {"score": score, "individual_scores": individual_scores}
@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available
Projects
None yet
Development

No branches or pull requests

2 participants