-
Notifications
You must be signed in to change notification settings - Fork 6k
Update evaluation packages and evaluators #46072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM @gewarren
Let's
| Fluency | Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability| <xref:Microsoft.Extensions.AI.Evaluation.Quality.FluencyEvaluator> | | ||
| Coherence | Evaluates the logical and orderly presentation of ideas | <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> | | ||
| Equivalence | Evaluates the similarity between the generated text and its ground truth with respect to a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator> | | ||
| Groundedness | Evaluates how well a generated response aligns with the given context | <xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator><br />`GroundednessProEvaluator` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please separate this into 2 tables? The first one for Quality
evaluators and the second for Safety
evaluators.
It would be great to include a sentence about each set at the top of each table to clarify that -
- Quality evaluators measure response quality for the following metrics and they use an LLM to perform the evaluation.
- Safety evaluators check for presence of harmful / inappropriate / unsafe content in responses and they rely on the Azure AI Foundry Evaluation service (which uses a fine-tuned model behind the scenes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shyamnamboodiripad Yes, will do in a follow up PR. Thanks for reviewing!
@@ -24,13 +25,25 @@ The libraries are designed to integrate smoothly with existing .NET apps, allowi | |||
|
|||
The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following table shows the built-in evaluators. | |||
|
|||
| Metric | Description | Evaluator type | | |||
|------------------------------------|----------------------------------------------|----------------| | |||
| Relevance, truth, and completeness | How effectively a response addresses a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceTruthAndCompletenessEvaluator> | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RTCEvaluator is still being shipped so I think we should continue to include it in the table - perhaps we can move it to the end of the table? Also, since it is now marked as [Experimental], it would be great to call this out somehow.
Note that the metrics returned from RTCEvaluator are named 'Relevance (RTC)', 'Truth (RTC)' and 'Completeness (RTC)' so that they won't conflict with the newly introduced dedicated 'Relevance' and 'Completeness' metrics.
Fixes #45078 (remove Semantic Kernel eval tutorial)
Contributes to #46071 (update packages and evaluators)
Internal previews