diff --git a/docs/ai/conceptual/evaluation-libraries.md b/docs/ai/conceptual/evaluation-libraries.md index bf81c80ed2b8a..bb5be136cb82e 100644 --- a/docs/ai/conceptual/evaluation-libraries.md +++ b/docs/ai/conceptual/evaluation-libraries.md @@ -2,7 +2,7 @@ title: The Microsoft.Extensions.AI.Evaluation libraries description: Learn about the Microsoft.Extensions.AI.Evaluation libraries, which simplify the process of evaluating the quality and accuracy of responses generated by AI models in .NET intelligent apps. ms.topic: concept-article -ms.date: 05/09/2025 +ms.date: 05/13/2025 --- # The Microsoft.Extensions.AI.Evaluation libraries (Preview) @@ -23,29 +23,44 @@ The libraries are designed to integrate smoothly with existing .NET apps, allowi ## Comprehensive evaluation metrics -The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following table shows the built-in evaluators. - -| Metric | Description | Evaluator type | -|--------------|--------------------------------------------------------|----------------| -| Relevance | Evaluates how relevant a response is to a query | `RelevanceEvaluator` | -| Completeness | Evaluates how comprehensive and accurate a response is | `CompletenessEvaluator` | -| Retrieval | Evaluates performance in retrieving information for additional context | `RetrievalEvaluator` | -| Fluency | Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability| | -| Coherence | Evaluates the logical and orderly presentation of ideas | | -| Equivalence | Evaluates the similarity between the generated text and its ground truth with respect to a query | | -| Groundedness | Evaluates how well a generated response aligns with the given context |
`GroundednessProEvaluator` | -| Protected material | Evaluates response for the presence of protected material | `ProtectedMaterialEvaluator` | -| Ungrounded human attributes | Evaluates a response for the presence of content that indicates ungrounded inference of human attributes | `UngroundedAttributesEvaluator` | -| Hate content | Evaluates a response for the presence of content that's hateful or unfair | `HateAndUnfairnessEvaluator`† | -| Self-harm content | Evaluates a response for the presence of content that indicates self harm | `SelfHarmEvaluator`† | -| Violent content | Evaluates a response for the presence of violent content | `ViolenceEvaluator`† | -| Sexual content | Evaluates a response for the presence of sexual content | `SexualEvaluator`† | -| Code vulnerability content | Evaluates a response for the presence of vulnerable code | `CodeVulnerabilityEvaluator` | -| Indirect attack content | Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering | `IndirectAttackEvaluator` | - -† In addition, the `ContentHarmEvaluator` provides single-shot evaluation for the four metrics supported by `HateAndUnfairnessEvaluator`, `SelfHarmEvaluator`, `ViolenceEvaluator`, and `SexualEvaluator`. - -You can also customize to add your own evaluations by implementing the interface or extending the base classes such as and . +The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following sections show the built-in [quality](#quality-evaluators) and [safety](#safety-evaluators) evaluators and the metrics they measure. + +You can also customize to add your own evaluations by implementing the interface. + +### Quality evaluators + +Quality evaluators measure response quality. They use an LLM to perform the evaluation. + +| Metric | Description | Evaluator type | +|----------------|--------------------------------------------------------|----------------| +| `Relevance` | Evaluates how relevant a response is to a query | | +| `Completeness` | Evaluates how comprehensive and accurate a response is | | +| `Retrieval` | Evaluates performance in retrieving information for additional context | | +| `Fluency` | Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability| | +| `Coherence` | Evaluates the logical and orderly presentation of ideas | | +| `Equivalence` | Evaluates the similarity between the generated text and its ground truth with respect to a query | | +| `Groundedness` | Evaluates how well a generated response aligns with the given context | | +| `Relevance (RTC)`, `Truth (RTC)`, and `Completeness (RTC)` | Evaluates how relevant, truthful, and complete a response is | † | + +† This evaluator is marked [experimental](../../fundamentals/syslib-diagnostics/experimental-overview.md). + +### Safety evaluators + +Safety evaluators check for presence of harmful, inappropriate, or unsafe content in a response. They rely on the Azure AI Foundry Evaluation service, which uses a model that's fine tuned to perform evaluations. + +| Metric | Description | Evaluator type | +|--------------------|-----------------------------------------------------------------------|------------------------------| +| `Groundedness Pro` | Uses a fine-tuned model hosted behind the Azure AI Foundry Evaluation service to evaluate how well a generated response aligns with the given context | | +| `Protected Material` | Evaluates response for the presence of protected material | | +| `Ungrounded Attributes` | Evaluates a response for the presence of content that indicates ungrounded inference of human attributes | | +| `Hate And Unfairness` | Evaluates a response for the presence of content that's hateful or unfair | † | +| `Self Harm` | Evaluates a response for the presence of content that indicates self harm | † | +| `Violence` | Evaluates a response for the presence of violent content | † | +| `Sexual` | Evaluates a response for the presence of sexual content | † | +| `Code Vulnerability` | Evaluates a response for the presence of vulnerable code | | +| `Indirect Attack` | Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering | | + +† In addition, the provides single-shot evaluation for the four metrics supported by `HateAndUnfairnessEvaluator`, `SelfHarmEvaluator`, `ViolenceEvaluator`, and `SexualEvaluator`. ## Cached responses