Azure · salma-elshafey · Jun 22, 2025 · Jun 23, 2025 · Jun 23, 2025 · Jun 23, 2025
@@ -25,6 +25,8 @@
 ### Bugs Fixed
 
 - Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.
+
+- Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a score in the range [0-1]. The score range is now [1-5].
 - Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. [#41415](https://github.com/Azure/azure-sdk-for-python/issues/41415)
 - Added a new enum `ADVERSARIAL_QA_DOCUMENTS` which moves all the "file_content" type prompts away from `ADVERSARIAL_QA` to the new enum
 

@@ -46,9 +46,6 @@ This guide walks you through how to investigate failures, common errors in the `
 - Risk and safety evaluators depend on the Azure AI Studio safety evaluation backend service. For a list of supported regions, please refer to the documentation [here](https://aka.ms/azureaisafetyeval-regionsupport).
 - If you encounter a 403 Unauthorized error when using safety evaluators, verify that you have the `Contributor` role assigned to your Azure AI project. `Contributor` role is currently required to run safety evaluations.
 
-### Troubleshoot Quality Evaluator Issues
-- For `ToolCallAccuracyEvaluator`, if your input did not have a tool to evaluate, the current behavior is to output `null`.
-
 ## Handle Simulation Errors
 
 ### Adversarial Simulation Supported Regions

@@ -4,14 +4,34 @@
 
 import inspect
 from abc import ABC, abstractmethod
-from typing import Any, Callable, Dict, Generic, List, TypedDict, TypeVar, Union, cast, final, Optional
+from typing import (
+    Any,
+    Callable,
+    Dict,
+    Generic,
+    List,
+    TypedDict,
+    TypeVar,
+    Union,
+    cast,
+    final,
+    Optional,
+)
 
 from azure.ai.evaluation._legacy._adapters.utils import async_run_allowing_running_loop
 from typing_extensions import ParamSpec, TypeAlias, get_overloads
 
-from azure.ai.evaluation._exceptions import ErrorBlame, ErrorCategory, ErrorTarget, EvaluationException
+from azure.ai.evaluation._exceptions import (
+    ErrorBlame,
+    ErrorCategory,
+    ErrorTarget,
+    EvaluationException,
+)
 from azure.ai.evaluation._common.utils import remove_optional_singletons
-from azure.ai.evaluation._constants import _AggregationType, EVALUATION_PASS_FAIL_MAPPING
+from azure.ai.evaluation._constants import (
+    _AggregationType,
+    EVALUATION_PASS_FAIL_MAPPING,
+)
 from azure.ai.evaluation._model_configurations import Conversation
 from azure.ai.evaluation._common._experimental import experimental
 
@@ -176,7 +196,9 @@ def _derive_singleton_inputs(self) -> List[str]:
             singletons.extend([p for p in params if p != "self"])
         return singletons
 
-    def _derive_conversation_converter(self) -> Callable[[Dict], List[DerivedEvalInput]]:
+    def _derive_conversation_converter(
+        self,
+    ) -> Callable[[Dict], List[DerivedEvalInput]]:
         """Produce the function that will be used to convert conversations to a list of evaluable inputs.
         This uses the inputs derived from the _derive_singleton_inputs function to determine which
         aspects of a conversation ought to be extracted.
@@ -235,7 +257,9 @@ def converter(conversation: Dict) -> List[DerivedEvalInput]:
 
         return converter
 
-    def _derive_multi_modal_conversation_converter(self) -> Callable[[Dict], List[Dict[str, Any]]]:
+    def _derive_multi_modal_conversation_converter(
+        self,
+    ) -> Callable[[Dict], List[Dict[str, Any]]]:
         """Produce the function that will be used to convert multi-modal conversations to a list of evaluable inputs.
         This uses the inputs derived from the _derive_singleton_inputs function to determine which
         aspects of a conversation ought to be extracted.
@@ -288,7 +312,7 @@ def multi_modal_converter(conversation: Dict) -> List[Dict[str, Any]]:
 
         return multi_modal_converter
 
-    def _convert_kwargs_to_eval_input(self, **kwargs) -> Union[List[Dict], List[DerivedEvalInput]]:
+    def _convert_kwargs_to_eval_input(self, **kwargs) -> Union[List[Dict], List[DerivedEvalInput], Dict[str, Any]]:
         """Convert an arbitrary input into a list of inputs for evaluators.
         It is assumed that evaluators generally make use of their inputs in one of two ways.
         Either they receive a collection of keyname inputs that are all single values