Skip to content

Tool Call Accuracy V2 #41740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
41de91a
support 5 levels, evaluate all tools at once
Jun 22, 2025
6a1e2b3
Update sample notebook and change log
Jun 23, 2025
0dad199
Add missing import
Jun 23, 2025
e4b1a37
Modify test cases to match the new output format
Jun 23, 2025
a40c91b
Modify other test file to match the new output format
Jun 23, 2025
ed0ecf9
Fixed parsing of results
Jun 24, 2025
9bc900b
Change key name in output
Jun 24, 2025
eaf493a
Spell check fixes
Jun 24, 2025
1965639
Minor prompt update
Jun 24, 2025
8865240
Update result key to tool_call_accuracy
Jun 25, 2025
fcd1cb8
Delete test_new_evaluator.ipynb
salma-elshafey Jun 25, 2025
67fc87d
Added field names and messages as constants
Jun 25, 2025
080f941
Merge branch 'selshafey/improve_tool_call_accuracy' of https://github…
Jun 25, 2025
fd2429f
Additional note in prompt
Jun 29, 2025
6c9e342
Re-add the temperature to the prompty file
Jun 30, 2025
d0f637e
Removed 'applicable' field and print statement
Jun 30, 2025
4c27dff
Move excess/missing tool calls fields under additional details
Jul 1, 2025
3fa14f0
Typo fix and removal of redundant field in the prompt
Jul 2, 2025
2c3ce50
Modify per_tool_call_details field's name to details
Jul 7, 2025
6525a6f
Revert "Modify per_tool_call_details field's name to details"
Jul 16, 2025
e72b084
Revert 'Merge branch 'main' into selshafey/improve_tool_call_accuracy'
Jul 16, 2025
3d4f2cc
Merge branch 'main' into selshafey/improve_tool_call_accuracy
Jul 16, 2025
a79b3a1
Black reformat
Jul 16, 2025
440b6c1
Reformat with black
Jul 16, 2025
e690217
To re-trigger build pipelines
Jul 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions sdk/evaluation/azure-ai-evaluation/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@

- Significant improvements to IntentResolution evaluator. New version has less variance, is nearly 2x faster and consumes fewer tokens.

- Fixes and improvements to ToolCallAccuracy evaluator. New version has less variance. and now works on all tool calls that happen in a turn at once. Previously, it worked on each tool call independently without having context on the other tool calls that happen in the same turn, and then aggregated the results to a number in the range [0-1]. The number range is now [1-5].

- Fixed MeteorScoreEvaluator and other threshold-based evaluators returning incorrect binary results due to integer conversion of decimal scores. Previously, decimal scores like 0.9375 were incorrectly converted to integers (0) before threshold comparison, causing them to fail even when above the threshold. [#41415](https://github.com/Azure/azure-sdk-for-python/issues/41415)

## 1.8.0 (2025-05-29)
Expand Down
3 changes: 0 additions & 3 deletions sdk/evaluation/azure-ai-evaluation/TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,6 @@ This guide walks you through how to investigate failures, common errors in the `
- Risk and safety evaluators depend on the Azure AI Studio safety evaluation backend service. For a list of supported regions, please refer to the documentation [here](https://aka.ms/azureaisafetyeval-regionsupport).
- If you encounter a 403 Unauthorized error when using safety evaluators, verify that you have the `Contributor` role assigned to your Azure AI project. `Contributor` role is currently required to run safety evaluations.

### Troubleshoot Quality Evaluator Issues
- For `ToolCallAccuracyEvaluator`, if your input did not have a tool to evaluate, the current behavior is to output `null`.

## Handle Simulation Errors

### Adversarial Simulation Supported Regions
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ def multi_modal_converter(conversation: Dict) -> List[Dict[str, Any]]:

return multi_modal_converter

def _convert_kwargs_to_eval_input(self, **kwargs) -> Union[List[Dict], List[DerivedEvalInput]]:
def _convert_kwargs_to_eval_input(self, **kwargs) -> Union[List[Dict], List[DerivedEvalInput], Dict[str, Any]]:
"""Convert an arbitrary input into a list of inputs for evaluators.
It is assumed that evaluators generally make use of their inputs in one of two ways.
Either they receive a collection of keyname inputs that are all single values
Expand Down

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,21 @@ description: Evaluates Tool Call Accuracy for tool used by agent
model:
api: chat
parameters:
temperature: 0.0
max_tokens: 800
temperature: 0
max_tokens: 3000
top_p: 1.0
presence_penalty: 0
frequency_penalty: 0
response_format:
type: text
type: json_object

inputs:
query:
type: array
tool_call:
type: object
tool_definition:
type: object
type: List
tool_calls:
type: List
tool_definitions:
type: Dict

---
system:
Expand All @@ -27,7 +27,7 @@ system:
### Your are an expert in evaluating the accuracy of a tool call considering relevance and potential usefulness including syntactic and semantic correctness of a proposed tool call from an intelligent system based on provided definition and data. Your goal will involve answering the questions below using the information provided.
- **Definition**: You are given a definition of the communication trait that is being evaluated to help guide your Score.
- **Data**: Your input data include CONVERSATION , TOOL CALL and TOOL DEFINITION.
- **Tasks**: To complete your evaluation you will be asked to evaluate the Data in different ways.
- **Tasks**: To complete your evaluation you will be asked to evaluate the Data in different ways, and you need to be very precise in your evaluation.

user:
# Definition
Expand All @@ -40,32 +40,109 @@ user:
4. Potential Value: Is the information this tool call might provide likely to be useful in advancing the conversation or addressing the user expressed or implied needs?
5. Context Appropriateness: Does the tool call make sense at this point in the conversation, given what has been discussed so far?


# Ratings
## [Tool Call Accuracy: 0] (Irrelevant)
## [Tool Call Accuracy: 1] (Irrelevant)
**Definition:**
Tool calls were not relevant to the user's query, resulting in an irrelevant or unhelpful final output.
This level is a 'fail'.

**Example:**
The user's query is asking for most popular hotels in New York, but the agent calls a tool that does search in local files on a machine. This tool is not relevant to the user query, so this case is a Level 1 'fail'.


## [Tool Call Accuracy: 2] (Partially Relevant - No output)
**Definition:**
Tool calls were somewhat related to the user's query, but the agent was not able to reach a final output that addresses the user query due to one or more of the following:
• Tools returned errors, and no retrials for the tool call were successful.
• Parameters passed to the tool were incorrect.
• Not enough tools were called to fully address the query (missing tool calls).
This level is a 'fail'.

**Example:**
The user asks for the coordinates of Chicago. The agent calls the correct tool that retrieves the coordinates -which is the relevant tool for the user query- but passes 'New York' instead of 'Chicago' as the parameter to the tool. So this is a Level 2 'fail'.

**Example:**
The user asks for the coordinates of Chicago. The agent calls the correct tool that retrieves the coordinates -which is the relevant tool for the user query- and passes 'Chicago' as the parameter to the tool which is also correct, but the tool returns an error so the agent can't reach the correct answer to the user's query. This is a Level 2 'fail'.

**Example:**
The user asks a question that needs 3 tool calls for it to be answered. The agent calls only one of the three required tool calls. So this case is a Level 2 'fail'.


## [Tool Call Accuracy: 3] (Slightly Correct - Reached Output)
**Definition:**
Tool calls were relevant and led to a correct output. However, multiple excessive, unnecessary tool calls were made.
This level is a 'pass'.

**Example:**
The user asked to do a modification in the database. The agent called the tool multiple times, resulting in multiple modifications in the database instead of one. This is a level 3 'pass'.

**Example:**
The user asked for popular hotels in a certain place. The agent calls the same tool with the same parameters multiple times, even though a single tool call that returns an output is sufficient. So there were unnecessary tool calls. This is a Level 3 'pass'.


## [Tool Call Accuracy: 4] (Mostly Correct - Reached output)
**Definition:**
1. The TOOL CALL is not relevant and will not help resolve the user's need.
2. TOOL CALL include parameters values that are not present or inferred from CONVERSATION.
3. TOOL CALL has parameters that is not present in TOOL DEFINITION.
Tool calls were fully relevant and efficient:
• Correct tools were called with the correct parameters, whether they are extracted from the conversation history or the current user query.
• A tool returned an error, but the agent retried calling the tool and successfully got an output.
This level is a 'pass'.

**Example:**
The user asks for the weather forecast in a certain place. The agent calls the correct tool that retrieves the weather forecast with the correct parameters, but the tool returns an error. The agent re-calls the tool once again and it returns the correct output. This is a Level 4 'pass'.


## [Tool Call Accuracy: 1] (Relevant)
## [Tool Call Accuracy: 5] (Optimal Solution - Reached output)
**Definition:**
1. The TOOL CALL is directly relevant and very likely to help resolve the user's need.
2. TOOL CALL include parameters values that are present or inferred from CONVERSATION.
3. TOOL CALL has parameters that is present in TOOL DEFINITION.
Tool calls were fully relevant and efficient:
• Correct tools were called with the correct parameters, whether they are extracted from the conversation history or the current user query.
• No unnecessary or excessive tool calls were made.
• No errors occurred in any of the tools.
• The agent was able to reach the final output that addresses the user's query without facing any issues.
This level is a 'pass'.

**Example:**
The user asks for the distance between two places. The agent correctly calls the tools that retrieve the coordinates for the two places respectively, then calls the tool that calculates the distance between the two sets of coordinates, passing the correct arguments to all the tools, without calling other tools excessively or unnecessarily. This is the optimal solution for the user's query. This is a Level 5 'pass'.

**Example:**
The user asks for the distance between two places. The agent retrieves the needed coordinates from the outputs of the tool calls in the conversation history, and then correctly passes these coordinates to the tool that calculates the distance to output it to the user. This is also an optimal solution for the user's query. This is a Level 5 'pass'.



# IMPORTANT NOTES
- There is a clear distinction between 'pass' levels and 'fail' levels. The distinction is that the tools are called correctly in order to reach the required output. If the agent was not able to reach the final output that addresses the user query, it cannot be either of the 'pass' levels, and vice versa. It is crucial that you ensure you are rating the agent's response with the correct level based on the tool calls made to address the user's query.
- You are NOT concerned with the correctness of the result of the tool. As long as the tool did not return an error, then the tool output is correct and accurate. Do not look into the correctness of the tool's result.

# Data
CONVERSATION : {{query}}
TOOL CALL: {{tool_call}}
TOOL CALLS: {{tool_calls}}
TOOL DEFINITION: {{tool_definition}}


# Tasks
## Please provide your assessment Score for the previous CONVERSATION , TOOL CALL and TOOL DEFINITION based on the Definitions above. Your output should include the following information:
- **ThoughtChain**: To improve the reasoning process, think step by step and include a step-by-step explanation of your thought process as you analyze the data based on the definitions. Keep it brief and start your ThoughtChain with "Let's think step by step:".
- **Explanation**: a very short explanation of why you think the input Data should get that Score.
- **Score**: based on your previous analysis, provide your Score. The Score you give MUST be a integer score (i.e., "0", "1") based on the levels of the definitions.

## Please provide your evaluation for the assistant RESPONSE in relation to the user QUERY and tool definitions based on the Definitions and examples above.
Your output should consist only of a JSON object, as provided in the examples, that has the following keys:
- chain_of_thought: a string that explains your thought process to decide on the tool call accuracy level. Start this string with 'Let's think step by step:', and think deeply and precisely about which level should be chosen based on the agent's tool calls and how they were able to address the user's query.
- tool_calls_success_level: a integer value between 1 and 5 that represents the level of tool call success, based on the level definitions mentioned before. You need to be very precise when deciding on this level. Ensure you are correctly following the rating system based on the description of each level.
- tool_calls_success_result: 'pass' or 'fail' based on the evaluation level of the tool call accuracy. Levels 1 and 2 are a 'fail', levels 3, 4 and 5 are a 'pass'.
- additional_details: a dictionary that contains the following keys:
- tool_calls_made_by_agent: total number of tool calls made by the agent
- correct_tool_calls_made_by_agent: total number of correct tool calls made by the agent
- details: a list of dictionaries, each containing:
- tool_name: name of the tool
- total_calls_required: total number of calls required for the tool
- correct_calls_made_by_agent: number of correct calls made by the agent
- correct_tool_percentage: percentage of correct calls made by the agent for this tool. It is a value between 0.0 and 1.0
- tool_call_errors: number of errors encountered during the tool call
- tool_success_result: 'pass' or 'fail' based on the evaluation of the tool call accuracy for this tool
- excess_tool_calls: a dictionary with the following keys:
- total: total number of excess, unnecessary tool calls made by the agent
- details: a list of dictionaries, each containing:
- tool_name: name of the tool
- excess_count: number of excess calls made for this query
- missing_tool_calls: a dictionary with the following keys:
- total: total number of missing tool calls that should have been made by the agent to be able to answer the query
- details: a list of dictionaries, each containing:
- tool_name: name of the tool
- missing_count: number of missing calls for this query

## Please provide your answers between the tags: <S0>your chain of thoughts</S0>, <S1>your explanation</S1>, <S2>your Score</S2>.
# Output
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"source": [
"### Getting Started\n",
"\n",
"This sample demonstrates how to use Intent Resolution Evaluator\n",
"This sample demonstrates how to use Tool Call Accuracy Evaluator\n",
"Before running the sample:\n",
"```bash\n",
"pip install azure-ai-projects azure-identity azure-ai-evaluation\n",
Expand All @@ -39,9 +39,12 @@
"- Parameter value extraction from the conversation\n",
"- Potential usefulness of the tool call\n",
"\n",
"The evaluator uses a binary scoring system (0 or 1):\n",
" - Score 0: The tool call is irrelevant or contains information not in the conversation/definition\n",
" - Score 1: The tool call is relevant with properly extracted parameters from the conversation\n",
"The evaluator uses a scoring rubric of 1 to 5:\n",
" - Score 1: The tool calls are irrelevant\n",
" - Score 2: The tool calls are partially relevant, but not enough tools were called or the parameters were not correctly passed\n",
" - Score 3: The tool calls are relevant, but there were unncessary, excessive tool calls made\n",
" - Score 4: The tool calls are relevant, but some tools returned errors and agent retried calling them again and succeeded\n",
" - Score 5: The tool calls are relevant, and all parameters were correctly passed and no excessive calls were made.\n",
"\n",
"This evaluation focuses on measuring whether tool calls meaningfully contribute to addressing query while properly following tool definitions and using information present in the conversation history."
]
Expand Down
Loading