Skip to content

Tool Call Accuracy V2 #41740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
41de91a
support 5 levels, evaluate all tools at once
Jun 22, 2025
6a1e2b3
Update sample notebook and change log
Jun 23, 2025
0dad199
Add missing import
Jun 23, 2025
e4b1a37
Modify test cases to match the new output format
Jun 23, 2025
a40c91b
Modify other test file to match the new output format
Jun 23, 2025
ed0ecf9
Fixed parsing of results
Jun 24, 2025
9bc900b
Change key name in output
Jun 24, 2025
eaf493a
Spell check fixes
Jun 24, 2025
1965639
Minor prompt update
Jun 24, 2025
8865240
Update result key to tool_call_accuracy
Jun 25, 2025
fcd1cb8
Delete test_new_evaluator.ipynb
salma-elshafey Jun 25, 2025
67fc87d
Added field names and messages as constants
Jun 25, 2025
080f941
Merge branch 'selshafey/improve_tool_call_accuracy' of https://github…
Jun 25, 2025
fd2429f
Additional note in prompt
Jun 29, 2025
6c9e342
Re-add the temperature to the prompty file
Jun 30, 2025
d0f637e
Removed 'applicable' field and print statement
Jun 30, 2025
4c27dff
Move excess/missing tool calls fields under additional details
Jul 1, 2025
3fa14f0
Typo fix and removal of redundant field in the prompt
Jul 2, 2025
2c3ce50
Modify per_tool_call_details field's name to details
Jul 7, 2025
6525a6f
Revert "Modify per_tool_call_details field's name to details"
Jul 16, 2025
e72b084
Revert 'Merge branch 'main' into selshafey/improve_tool_call_accuracy'
Jul 16, 2025
3d4f2cc
Merge branch 'main' into selshafey/improve_tool_call_accuracy
Jul 16, 2025
a79b3a1
Black reformat
Jul 16, 2025
440b6c1
Reformat with black
Jul 16, 2025
e690217
To re-trigger build pipelines
Jul 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions sdk/evaluation/azure-ai-evaluation/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@
### Bugs Fixed

- Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.

- Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a score in the range [0-1]. The score range is now [1-5].
- Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. [#41415](https://github.com/Azure/azure-sdk-for-python/issues/41415)
- Added a new enum `ADVERSARIAL_QA_DOCUMENTS` which moves all the "file_content" type prompts away from `ADVERSARIAL_QA` to the new enum

Expand Down
3 changes: 0 additions & 3 deletions sdk/evaluation/azure-ai-evaluation/TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,6 @@ This guide walks you through how to investigate failures, common errors in the `
- Risk and safety evaluators depend on the Azure AI Studio safety evaluation backend service. For a list of supported regions, please refer to the documentation [here](https://aka.ms/azureaisafetyeval-regionsupport).
- If you encounter a 403 Unauthorized error when using safety evaluators, verify that you have the `Contributor` role assigned to your Azure AI project. `Contributor` role is currently required to run safety evaluations.

### Troubleshoot Quality Evaluator Issues
- For `ToolCallAccuracyEvaluator`, if your input did not have a tool to evaluate, the current behavior is to output `null`.

## Handle Simulation Errors

### Adversarial Simulation Supported Regions
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,34 @@

import inspect
from abc import ABC, abstractmethod
from typing import Any, Callable, Dict, Generic, List, TypedDict, TypeVar, Union, cast, final, Optional
from typing import (
Any,
Callable,
Dict,
Generic,
List,
TypedDict,
TypeVar,
Union,
cast,
final,
Optional,
)

from azure.ai.evaluation._legacy._adapters.utils import async_run_allowing_running_loop
from typing_extensions import ParamSpec, TypeAlias, get_overloads

from azure.ai.evaluation._exceptions import ErrorBlame, ErrorCategory, ErrorTarget, EvaluationException
from azure.ai.evaluation._exceptions import (
ErrorBlame,
ErrorCategory,
ErrorTarget,
EvaluationException,
)
from azure.ai.evaluation._common.utils import remove_optional_singletons
from azure.ai.evaluation._constants import _AggregationType, EVALUATION_PASS_FAIL_MAPPING
from azure.ai.evaluation._constants import (
_AggregationType,
EVALUATION_PASS_FAIL_MAPPING,
)
from azure.ai.evaluation._model_configurations import Conversation
from azure.ai.evaluation._common._experimental import experimental

Expand Down Expand Up @@ -176,7 +196,9 @@ def _derive_singleton_inputs(self) -> List[str]:
singletons.extend([p for p in params if p != "self"])
return singletons

def _derive_conversation_converter(self) -> Callable[[Dict], List[DerivedEvalInput]]:
def _derive_conversation_converter(
self,
) -> Callable[[Dict], List[DerivedEvalInput]]:
"""Produce the function that will be used to convert conversations to a list of evaluable inputs.
This uses the inputs derived from the _derive_singleton_inputs function to determine which
aspects of a conversation ought to be extracted.
Expand Down Expand Up @@ -235,7 +257,9 @@ def converter(conversation: Dict) -> List[DerivedEvalInput]:

return converter

def _derive_multi_modal_conversation_converter(self) -> Callable[[Dict], List[Dict[str, Any]]]:
def _derive_multi_modal_conversation_converter(
self,
) -> Callable[[Dict], List[Dict[str, Any]]]:
"""Produce the function that will be used to convert multi-modal conversations to a list of evaluable inputs.
This uses the inputs derived from the _derive_singleton_inputs function to determine which
aspects of a conversation ought to be extracted.
Expand Down Expand Up @@ -288,7 +312,7 @@ def multi_modal_converter(conversation: Dict) -> List[Dict[str, Any]]:

return multi_modal_converter

def _convert_kwargs_to_eval_input(self, **kwargs) -> Union[List[Dict], List[DerivedEvalInput]]:
def _convert_kwargs_to_eval_input(self, **kwargs) -> Union[List[Dict], List[DerivedEvalInput], Dict[str, Any]]:
"""Convert an arbitrary input into a list of inputs for evaluators.
It is assumed that evaluators generally make use of their inputs in one of two ways.
Either they receive a collection of keyname inputs that are all single values
Expand Down
Loading