You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty
+10-9Lines changed: 10 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -4,8 +4,7 @@ description: Evaluates Tool Call Accuracy for tool used by agent
4
4
model:
5
5
api: chat
6
6
parameters:
7
-
temperature: 0
8
-
max_tokens: 3000
7
+
max_completion_tokens: 3000
9
8
top_p: 1.0
10
9
presence_penalty: 0
11
10
frequency_penalty: 0
@@ -40,17 +39,18 @@ user:
40
39
4. Potential Value: Is the information this tool call might provide likely to be useful in advancing the conversation or addressing the user expressed or implied needs?
41
40
5. Context Appropriateness: Does the tool call make sense at this point in the conversation, given what has been discussed so far?
42
41
42
+
43
43
# Ratings
44
44
## [Tool Call Accuracy: 1] (Irrelevant)
45
45
**Definition:**
46
-
Tool calls were not relevant to the user's query, resulting in an irrelevant or unhelpful final output.
46
+
Tool calls were not relevant to the user's query, resulting in anirrelevant or unhelpful final output.
47
47
This level is a 'fail'.
48
48
49
49
**Example:**
50
50
The user's query is asking for most popular hotels in New York, but the agent calls a tool that does search in local files on a machine. This tool is not relevant to the user query, so this case is a Level 1 'fail'.
51
51
52
52
53
-
## [Tool Call Accuracy: 2] (Partially Relevant - No output)
Tool calls were somewhat related to the user's query, but the agent was not able to reach a final output that addresses the user query due to one or more of the following:
56
56
• Tools returned errors, and no retrials for the tool call were successful.
Tool calls were relevantand led to a correct output. However, multiple excessive, unnecessary tool calls were made.
73
+
Tool calls were relevant, correct and grounded parameters were passed so that led to a correct output. However, multiple excessive, unnecessary tool calls were made.
• Correct tools were called with the correct parameters, whether they are extracted from the conversation history or the current user query.
86
+
• Correct tools were called with the correct and grounded parameters, whether they are extracted from the conversation history or the current user query.
87
87
• A tool returned an error, but the agent retried calling the tool and successfully got an output.
• Correct tools were called with the correct parameters, whether they are extracted from the conversation history or the current user query.
97
+
• Correct tools were called with the correct and grounded parameters, whether they are extracted from the conversation history or the current user query.
98
98
• No unnecessary or excessive tool calls were made.
99
99
• No errors occurred in any of the tools.
100
100
• The agent was able to reach the final output that addresses the user's query without facing any issues.
@@ -110,7 +110,8 @@ This level is a 'pass'.
110
110
111
111
# IMPORTANT NOTES
112
112
- There is a clear distinction between 'pass' levels and 'fail' levels. The distinction is that the tools are called correctly in order to reach the required output. If the agent was not able to reach the final output that addresses the user query, it cannot be either of the 'pass' levels, and vice versa. It is crucial that you ensure you are rating the agent's response with the correct level based on the tool calls made to address the user's query.
113
-
- You are NOT concerned with the correctness of the result of the tool. As long as the tool did not return an error, then the tool output is correct and accurate. Do not look into the correctness of the tool's result.
113
+
- "Correct output" means correct tool with the correct, grounded parameters. You are NOT concerned with the correctness of the result of the tool. As long as the parameters passed were correct and the tool did not return an error, then the tool output is correct and accurate.
114
+
- Ensure that every single parameter that is passed to the tools is correct and grounded from the user query or the conversation history. If the agent passes incorrect parameters or completely makes them up, then this is a fail, even if somehow the agent reaches a correct result.
Your output should consist only of a JSON object, as provided in the examples, that has the following keys:
124
125
- chain_of_thought: a string that explains your thought process to decide on the tool call accuracy level. Start this string with 'Let's think step by step:', and think deeply and precisely about which level should be chosen based on the agent's tool calls and how they were able to address the user's query.
125
126
- tool_calls_success_level: a integer value between 1 and 5 that represents the level of tool call success, based on the level definitions mentioned before. You need to be very precise when deciding on this level. Ensure you are correctly following the rating system based on the description of each level.
126
-
- tool_calls_success_result: 'pass' or 'fail' based on the evaluation level of the tool call accuracy. Levels 1 and 2 are a 'fail', levels 3, 4 and 5 are a 'pass'.
127
+
- tool_calls_sucess_result: 'pass' or 'fail' based on the evaluation level of the tool call accuracy. Levels 1 and 2 are a 'fail', levels 3, 4 and 5 are a 'pass'.
127
128
- additional_details: a dictionary that contains the following keys:
128
129
- tool_calls_made_by_agent: total number of tool calls made by the agent
129
130
- correct_tool_calls_made_by_agent: total number of correct tool calls made by the agent
0 commit comments