Skip to content

Commit fd2429f

Browse files
author
Salma Elshafey
committed
Additional note in prompt
1 parent 080f941 commit fd2429f

File tree

1 file changed

+10
-9
lines changed

1 file changed

+10
-9
lines changed

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_tool_call_accuracy/tool_call_accuracy.prompty

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,7 @@ description: Evaluates Tool Call Accuracy for tool used by agent
44
model:
55
api: chat
66
parameters:
7-
temperature: 0
8-
max_tokens: 3000
7+
max_completion_tokens: 3000
98
top_p: 1.0
109
presence_penalty: 0
1110
frequency_penalty: 0
@@ -40,17 +39,18 @@ user:
4039
4. Potential Value: Is the information this tool call might provide likely to be useful in advancing the conversation or addressing the user expressed or implied needs?
4140
5. Context Appropriateness: Does the tool call make sense at this point in the conversation, given what has been discussed so far?
4241

42+
4343
# Ratings
4444
## [Tool Call Accuracy: 1] (Irrelevant)
4545
**Definition:**
46-
Tool calls were not relevant to the user's query, resulting in an irrelevant or unhelpful final output.
46+
Tool calls were not relevant to the user's query, resulting in anirrelevant or unhelpful final output.
4747
This level is a 'fail'.
4848

4949
**Example:**
5050
The user's query is asking for most popular hotels in New York, but the agent calls a tool that does search in local files on a machine. This tool is not relevant to the user query, so this case is a Level 1 'fail'.
5151

5252

53-
## [Tool Call Accuracy: 2] (Partially Relevant - No output)
53+
## [Tool Call Accuracy: 2] (Partially Relevant - No correct output)
5454
**Definition:**
5555
Tool calls were somewhat related to the user's query, but the agent was not able to reach a final output that addresses the user query due to one or more of the following:
5656
• Tools returned errors, and no retrials for the tool call were successful.
@@ -70,7 +70,7 @@ This level is a 'fail'.
7070

7171
## [Tool Call Accuracy: 3] (Slightly Correct - Reached Output)
7272
**Definition:**
73-
Tool calls were relevant and led to a correct output. However, multiple excessive, unnecessary tool calls were made.
73+
Tool calls were relevant, correct and grounded parameters were passed so that led to a correct output. However, multiple excessive, unnecessary tool calls were made.
7474
This level is a 'pass'.
7575

7676
**Example:**
@@ -83,7 +83,7 @@ This level is a 'pass'.
8383
## [Tool Call Accuracy: 4] (Mostly Correct - Reached output)
8484
**Definition:**
8585
Tool calls were fully relevant and efficient:
86-
• Correct tools were called with the correct parameters, whether they are extracted from the conversation history or the current user query.
86+
• Correct tools were called with the correct and grounded parameters, whether they are extracted from the conversation history or the current user query.
8787
• A tool returned an error, but the agent retried calling the tool and successfully got an output.
8888
This level is a 'pass'.
8989

@@ -94,7 +94,7 @@ This level is a 'pass'.
9494
## [Tool Call Accuracy: 5] (Optimal Solution - Reached output)
9595
**Definition:**
9696
Tool calls were fully relevant and efficient:
97-
• Correct tools were called with the correct parameters, whether they are extracted from the conversation history or the current user query.
97+
• Correct tools were called with the correct and grounded parameters, whether they are extracted from the conversation history or the current user query.
9898
• No unnecessary or excessive tool calls were made.
9999
• No errors occurred in any of the tools.
100100
• The agent was able to reach the final output that addresses the user's query without facing any issues.
@@ -110,7 +110,8 @@ This level is a 'pass'.
110110

111111
# IMPORTANT NOTES
112112
- There is a clear distinction between 'pass' levels and 'fail' levels. The distinction is that the tools are called correctly in order to reach the required output. If the agent was not able to reach the final output that addresses the user query, it cannot be either of the 'pass' levels, and vice versa. It is crucial that you ensure you are rating the agent's response with the correct level based on the tool calls made to address the user's query.
113-
- You are NOT concerned with the correctness of the result of the tool. As long as the tool did not return an error, then the tool output is correct and accurate. Do not look into the correctness of the tool's result.
113+
- "Correct output" means correct tool with the correct, grounded parameters. You are NOT concerned with the correctness of the result of the tool. As long as the parameters passed were correct and the tool did not return an error, then the tool output is correct and accurate.
114+
- Ensure that every single parameter that is passed to the tools is correct and grounded from the user query or the conversation history. If the agent passes incorrect parameters or completely makes them up, then this is a fail, even if somehow the agent reaches a correct result.
114115

115116
# Data
116117
CONVERSATION : {{query}}
@@ -123,7 +124,7 @@ TOOL DEFINITION: {{tool_definition}}
123124
Your output should consist only of a JSON object, as provided in the examples, that has the following keys:
124125
- chain_of_thought: a string that explains your thought process to decide on the tool call accuracy level. Start this string with 'Let's think step by step:', and think deeply and precisely about which level should be chosen based on the agent's tool calls and how they were able to address the user's query.
125126
- tool_calls_success_level: a integer value between 1 and 5 that represents the level of tool call success, based on the level definitions mentioned before. You need to be very precise when deciding on this level. Ensure you are correctly following the rating system based on the description of each level.
126-
- tool_calls_success_result: 'pass' or 'fail' based on the evaluation level of the tool call accuracy. Levels 1 and 2 are a 'fail', levels 3, 4 and 5 are a 'pass'.
127+
- tool_calls_sucess_result: 'pass' or 'fail' based on the evaluation level of the tool call accuracy. Levels 1 and 2 are a 'fail', levels 3, 4 and 5 are a 'pass'.
127128
- additional_details: a dictionary that contains the following keys:
128129
- tool_calls_made_by_agent: total number of tool calls made by the agent
129130
- correct_tool_calls_made_by_agent: total number of correct tool calls made by the agent

0 commit comments

Comments
 (0)