Skip to content

Update evaluation packages and evaluators #46072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 12, 2025
Merged

Conversation

gewarren
Copy link
Contributor

@gewarren gewarren commented May 9, 2025

Fixes #45078 (remove Semantic Kernel eval tutorial)
Contributes to #46071 (update packages and evaluators)


Internal previews

📄 File 🔗 Preview link
docs/ai/conceptual/evaluation-libraries.md The Microsoft.Extensions.AI.Evaluation libraries (Preview)
docs/ai/toc.yml docs/ai/toc

@gewarren gewarren requested review from alexwolfmsft and a team as code owners May 9, 2025 22:28
@dotnetrepoman dotnetrepoman bot added this to the May 2025 milestone May 9, 2025
Copy link
Member

@BillWagner BillWagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM @gewarren

Let's :shipit:

@gewarren gewarren merged commit 01409b0 into dotnet:main May 12, 2025
8 checks passed
@gewarren gewarren deleted the eval-1 branch May 12, 2025 18:31
| Fluency | Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability| <xref:Microsoft.Extensions.AI.Evaluation.Quality.FluencyEvaluator> |
| Coherence | Evaluates the logical and orderly presentation of ideas | <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> |
| Equivalence | Evaluates the similarity between the generated text and its ground truth with respect to a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator> |
| Groundedness | Evaluates how well a generated response aligns with the given context | <xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator><br />`GroundednessProEvaluator` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please separate this into 2 tables? The first one for Quality evaluators and the second for Safety evaluators.

It would be great to include a sentence about each set at the top of each table to clarify that -

  • Quality evaluators measure response quality for the following metrics and they use an LLM to perform the evaluation.
  • Safety evaluators check for presence of harmful / inappropriate / unsafe content in responses and they rely on the Azure AI Foundry Evaluation service (which uses a fine-tuned model behind the scenes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shyamnamboodiripad Yes, will do in a follow up PR. Thanks for reviewing!

@@ -24,13 +25,25 @@ The libraries are designed to integrate smoothly with existing .NET apps, allowi

The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following table shows the built-in evaluators.

| Metric | Description | Evaluator type |
|------------------------------------|----------------------------------------------|----------------|
| Relevance, truth, and completeness | How effectively a response addresses a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceTruthAndCompletenessEvaluator> |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RTCEvaluator is still being shipped so I think we should continue to include it in the table - perhaps we can move it to the end of the table? Also, since it is now marked as [Experimental], it would be great to call this out somehow.

Note that the metrics returned from RTCEvaluator are named 'Relevance (RTC)', 'Truth (RTC)' and 'Completeness (RTC)' so that they won't conflict with the newly introduced dedicated 'Relevance' and 'Completeness' metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update LLM evaluation tutorial
4 participants