Skip to content

Commit bf66ea7

Browse files
salma-elshafeySalma Elshafey
andauthored
Tool Call Accuracy V2 (#41740)
* support 5 levels, evaluate all tools at once * Update sample notebook and change log * Add missing import * Modify test cases to match the new output format * Modify other test file to match the new output format * Fixed parsing of results * Change key name in output * Spell check fixes * Minor prompt update * Update result key to tool_call_accuracy * Delete test_new_evaluator.ipynb * Added field names and messages as constants * Additional note in prompt * Re-add the temperature to the prompty file * Removed 'applicable' field and print statement * Move excess/missing tool calls fields under additional details * Typo fix and removal of redundant field in the prompt * Modify per_tool_call_details field's name to details * Revert "Modify per_tool_call_details field's name to details" This reverts commit 2c3ce50. * Revert 'Merge branch 'main' into selshafey/improve_tool_call_accuracy' * Black reformat * Reformat with black * To re-trigger build pipelines --------- Co-authored-by: Salma Elshafey <selshafey@microsoft.com>
1 parent 007b9c9 commit bf66ea7

File tree

8 files changed

+522
-451
lines changed

8 files changed

+522
-451
lines changed

sdk/evaluation/azure-ai-evaluation/CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@
2525
### Bugs Fixed
2626

2727
- Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.
28+
29+
- Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a score in the range [0-1]. The score range is now [1-5].
2830
- Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. [#41415](https://github.com/Azure/azure-sdk-for-python/issues/41415)
2931
- Added a new enum `ADVERSARIAL_QA_DOCUMENTS` which moves all the "file_content" type prompts away from `ADVERSARIAL_QA` to the new enum
3032

sdk/evaluation/azure-ai-evaluation/TROUBLESHOOTING.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,9 +46,6 @@ This guide walks you through how to investigate failures, common errors in the `
4646
- Risk and safety evaluators depend on the Azure AI Studio safety evaluation backend service. For a list of supported regions, please refer to the documentation [here](https://aka.ms/azureaisafetyeval-regionsupport).
4747
- If you encounter a 403 Unauthorized error when using safety evaluators, verify that you have the `Contributor` role assigned to your Azure AI project. `Contributor` role is currently required to run safety evaluations.
4848
49-
### Troubleshoot Quality Evaluator Issues
50-
- For `ToolCallAccuracyEvaluator`, if your input did not have a tool to evaluate, the current behavior is to output `null`.
51-
5249
## Handle Simulation Errors
5350
5451
### Adversarial Simulation Supported Regions

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_common/_base_eval.py

Lines changed: 30 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,34 @@
44

55
import inspect
66
from abc import ABC, abstractmethod
7-
from typing import Any, Callable, Dict, Generic, List, TypedDict, TypeVar, Union, cast, final, Optional
7+
from typing import (
8+
Any,
9+
Callable,
10+
Dict,
11+
Generic,
12+
List,
13+
TypedDict,
14+
TypeVar,
15+
Union,
16+
cast,
17+
final,
18+
Optional,
19+
)
820

921
from azure.ai.evaluation._legacy._adapters.utils import async_run_allowing_running_loop
1022
from typing_extensions import ParamSpec, TypeAlias, get_overloads
1123

12-
from azure.ai.evaluation._exceptions import ErrorBlame, ErrorCategory, ErrorTarget, EvaluationException
24+
from azure.ai.evaluation._exceptions import (
25+
ErrorBlame,
26+
ErrorCategory,
27+
ErrorTarget,
28+
EvaluationException,
29+
)
1330
from azure.ai.evaluation._common.utils import remove_optional_singletons
14-
from azure.ai.evaluation._constants import _AggregationType, EVALUATION_PASS_FAIL_MAPPING
31+
from azure.ai.evaluation._constants import (
32+
_AggregationType,
33+
EVALUATION_PASS_FAIL_MAPPING,
34+
)
1535
from azure.ai.evaluation._model_configurations import Conversation
1636
from azure.ai.evaluation._common._experimental import experimental
1737

@@ -176,7 +196,9 @@ def _derive_singleton_inputs(self) -> List[str]:
176196
singletons.extend([p for p in params if p != "self"])
177197
return singletons
178198

179-
def _derive_conversation_converter(self) -> Callable[[Dict], List[DerivedEvalInput]]:
199+
def _derive_conversation_converter(
200+
self,
201+
) -> Callable[[Dict], List[DerivedEvalInput]]:
180202
"""Produce the function that will be used to convert conversations to a list of evaluable inputs.
181203
This uses the inputs derived from the _derive_singleton_inputs function to determine which
182204
aspects of a conversation ought to be extracted.
@@ -235,7 +257,9 @@ def converter(conversation: Dict) -> List[DerivedEvalInput]:
235257

236258
return converter
237259

238-
def _derive_multi_modal_conversation_converter(self) -> Callable[[Dict], List[Dict[str, Any]]]:
260+
def _derive_multi_modal_conversation_converter(
261+
self,
262+
) -> Callable[[Dict], List[Dict[str, Any]]]:
239263
"""Produce the function that will be used to convert multi-modal conversations to a list of evaluable inputs.
240264
This uses the inputs derived from the _derive_singleton_inputs function to determine which
241265
aspects of a conversation ought to be extracted.
@@ -288,7 +312,7 @@ def multi_modal_converter(conversation: Dict) -> List[Dict[str, Any]]:
288312

289313
return multi_modal_converter
290314

291-
def _convert_kwargs_to_eval_input(self, **kwargs) -> Union[List[Dict], List[DerivedEvalInput]]:
315+
def _convert_kwargs_to_eval_input(self, **kwargs) -> Union[List[Dict], List[DerivedEvalInput], Dict[str, Any]]:
292316
"""Convert an arbitrary input into a list of inputs for evaluators.
293317
It is assumed that evaluators generally make use of their inputs in one of two ways.
294318
Either they receive a collection of keyname inputs that are all single values

0 commit comments

Comments
 (0)