Skip to content

Update evaluation packages and evaluators #46072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 12, 2025
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .openpublishing.redirection.ai.json
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,10 @@
{
"source_path_from_root": "/docs/ai/quickstarts/quickstart-openai-summarize-text.md",
"redirect_url": "/dotnet/ai/quickstarts/prompt-model"
},
{
"source_path_from_root": "/docs/ai/tutorials/llm-eval.md",
"redirect_url": "/dotnet/ai/quickstarts/evaluate-ai-response"
}
]
}
31 changes: 22 additions & 9 deletions docs/ai/conceptual/evaluation-libraries.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: The Microsoft.Extensions.AI.Evaluation libraries
description: Learn about the Microsoft.Extensions.AI.Evaluation libraries, which simplify the process of evaluating the quality and accuracy of responses generated by AI models in .NET intelligent apps.
ms.topic: concept-article
ms.date: 03/18/2025
ms.date: 05/09/2025
---
# The Microsoft.Extensions.AI.Evaluation libraries (Preview)

Expand All @@ -11,7 +11,8 @@ The Microsoft.Extensions.AI.Evaluation libraries (currently in preview) simplify
The evaluation libraries, which are built on top of the [Microsoft.Extensions.AI abstractions](../microsoft-extensions-ai.md), are composed of the following NuGet packages:

- [📦 Microsoft.Extensions.AI.Evaluation](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation) – Defines the core abstractions and types for supporting evaluation.
- [📦 Microsoft.Extensions.AI.Evaluation.Quality](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) – Contains evaluators that assess the quality of LLM responses in an app according to metrics such as relevance, fluency, coherence, and truthfulness.
- [📦 Microsoft.Extensions.AI.Evaluation.Quality](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Quality) – Contains evaluators that assess the quality of LLM responses in an app according to metrics such as relevance and completeness. These evaluators use the LLM directly to perform evaluations.
- [📦 Microsoft.Extensions.AI.Evaluation.Safety](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Safety) – Contains evaluators, such as the `ProtectedMaterialEvaluator` and `ContentHarmEvaluator`, that use the [Azure AI Foundry](/azure/ai-foundry/) Evaluation service to perform evaluations.
- [📦 Microsoft.Extensions.AI.Evaluation.Reporting](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting) – Contains support for caching LLM responses, storing the results of evaluations, and generating reports from that data.
- [📦 Microsoft.Extensions.AI.Evaluation.Reporting.Azure](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Reporting.Azure) - Supports the reporting library with an implementation for caching LLM responses and storing the evaluation results in an [Azure Storage](/azure/storage/common/storage-introduction) container.
- [📦 Microsoft.Extensions.AI.Evaluation.Console](https://www.nuget.org/packages/Microsoft.Extensions.AI.Evaluation.Console) – A command-line tool for generating reports and managing evaluation data.
Expand All @@ -24,13 +25,25 @@ The libraries are designed to integrate smoothly with existing .NET apps, allowi

The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following table shows the built-in evaluators.

| Metric | Description | Evaluator type |
|------------------------------------|----------------------------------------------|----------------|
| Relevance, truth, and completeness | How effectively a response addresses a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceTruthAndCompletenessEvaluator> |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RTCEvaluator is still being shipped so I think we should continue to include it in the table - perhaps we can move it to the end of the table? Also, since it is now marked as [Experimental], it would be great to call this out somehow.

Note that the metrics returned from RTCEvaluator are named 'Relevance (RTC)', 'Truth (RTC)' and 'Completeness (RTC)' so that they won't conflict with the newly introduced dedicated 'Relevance' and 'Completeness' metrics.

| Fluency | Grammatical accuracy, vocabulary range, sentence complexity, and overall readability| <xref:Microsoft.Extensions.AI.Evaluation.Quality.FluencyEvaluator> |
| Coherence | The logical and orderly presentation of ideas | <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> |
| Equivalence | The similarity between the generated text and its ground truth with respect to a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator> |
| Groundedness | How well a generated response aligns with the given context | <xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator> |
| Metric | Description | Evaluator type |
|--------------|--------------------------------------------------------|----------------|
| Relevance | Evaluates how relevant a response is to a query | `RelevanceEvaluator` <!-- <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceEvaluator> --> |
| Completeness | Evaluates how comprehensive and accurate a response is | `CompletenessEvaluator` <!-- <xref:Microsoft.Extensions.AI.Evaluation.Quality.CompletenessEvaluator> --> |
| Retrieval | Evaluates performance in retrieving information for additional context | `RetrievalEvaluator` <!-- <xref:Microsoft.Extensions.AI.Evaluation.Quality.RetrievalEvaluator> --> |
| Fluency | Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability| <xref:Microsoft.Extensions.AI.Evaluation.Quality.FluencyEvaluator> |
| Coherence | Evaluates the logical and orderly presentation of ideas | <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> |
| Equivalence | Evaluates the similarity between the generated text and its ground truth with respect to a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator> |
| Groundedness | Evaluates how well a generated response aligns with the given context | <xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator><br />`GroundednessProEvaluator` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please separate this into 2 tables? The first one for Quality evaluators and the second for Safety evaluators.

It would be great to include a sentence about each set at the top of each table to clarify that -

  • Quality evaluators measure response quality for the following metrics and they use an LLM to perform the evaluation.
  • Safety evaluators check for presence of harmful / inappropriate / unsafe content in responses and they rely on the Azure AI Foundry Evaluation service (which uses a fine-tuned model behind the scenes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shyamnamboodiripad Yes, will do in a follow up PR. Thanks for reviewing!

| Protected material | Evaluates response for the presence of protected material | `ProtectedMaterialEvaluator` |
| Ungroundedness | Evaluates a response for the presence of content that indicates ungrounded inference of human attributes | `UngroundedAttributesEvaluator` |
| Hate content | Evaluates a response for the presence of content that's hateful or unfair | `HateAndUnfairnessEvaluator`† |
| Self-harm content | Evaluates a response for the presence of content that indicates self harm | `SelfHarmEvaluator`† |
| Violent content | Evaluates a response for the presence of violent content | `ViolenceEvaluator`† |
| Sexual content | Evaluates a response for the presence of sexual content | `SexualEvaluator`† |
| Code vulnerability content | Evaluates a response for the presence of vulnerable code | `CodeVulnerabilityEvaluator` |
| Indirect attack content | Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering | `IndirectAttackEvaluator` |

† In addition, the `ContentHarmEvaluator` provides single-shot evaluation for the four metrics supported by `HateAndUnfairnessEvaluator`, `SelfHarmEvaluator`, `ViolenceEvaluator`, and `SexualEvaluator`.

You can also customize to add your own evaluations by implementing the <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator> interface or extending the base classes such as <xref:Microsoft.Extensions.AI.Evaluation.Quality.ChatConversationEvaluator> and <xref:Microsoft.Extensions.AI.Evaluation.Quality.SingleNumericMetricEvaluator>.

Expand Down
2 changes: 0 additions & 2 deletions docs/ai/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,6 @@ items:
href: quickstarts/evaluate-ai-response.md
- name: "Tutorial: Evaluate a response with response caching and reporting"
href: tutorials/evaluate-with-reporting.md
- name: "Tutorial: Evaluate LLM prompt completions"
href: tutorials/llm-eval.md
- name: Resources
items:
- name: API reference
Expand Down
159 changes: 0 additions & 159 deletions docs/ai/tutorials/llm-eval.md

This file was deleted.