You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sdk/evaluation/azure-ai-evaluation/CHANGELOG.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@
8
8
9
9
- Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.
10
10
11
-
- Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a number in the range [0-1]. The number range is now [1-5].
11
+
- Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a score in the range [0-1]. The score range is now [1-5].
12
12
13
13
- Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. [#41415](https://github.com/Azure/azure-sdk-for-python/issues/41415)
Copy file name to clipboardExpand all lines: sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/_tool_call_accuracy.py
+18-9Lines changed: 18 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -71,6 +71,15 @@ class ToolCallAccuracyEvaluator(PromptyEvaluatorBase[Union[str, float]]):
71
71
_MIN_TOOL_CALL_ACCURACY_SCORE=1
72
72
_DEFAULT_TOOL_CALL_ACCURACY_SCORE=3
73
73
74
+
_NO_TOOL_CALLS_MESSAGE="No tool calls found in response or provided tool_calls."
75
+
_NO_TOOL_DEFINITIONS_MESSAGE="Tool definitions must be provided."
76
+
_TOOL_DEFINITIONS_MISSING_MESSAGE="Tool definitions for all tool calls must be provided."
77
+
_INVALID_SCORE_MESSAGE="Tool call accuracy score must be between 1 and 5."
78
+
79
+
_LLM_SCORE_KEY="tool_calls_success_level"
80
+
_EXCESS_TOOL_CALLS_KEY="excess_tool_calls"
81
+
_MISSING_TOOL_CALLS_KEY="missing_tool_calls"
82
+
74
83
id="id"
75
84
"""Evaluator identifier, experimental and to be used only with evaluation in cloud."""
message=f"Invalid score value: {score}. Expected a number in range [{ToolCallAccuracyEvaluator._MIN_TOOL_CALL_ACCURACY_SCORE}, {ToolCallAccuracyEvaluator._MAX_TOOL_CALL_ACCURACY_SCORE}].",
0 commit comments