Skip to content

Commit 70bcbe7

Browse files
authored
Merge pull request #11 from NicholasGoh/feat/blog-for-evaluating-agents
Feat/blog for evaluating agents
2 parents 7197360 + 7b8eb73 commit 70bcbe7

File tree

5 files changed

+137
-0
lines changed

5 files changed

+137
-0
lines changed
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
---
2+
slug: evaluating-agents-in-real-time
3+
title: Evaluating Agents in Real Time
4+
authors: [nicholas]
5+
tags: [evaluation, agentic-workflow]
6+
---
7+
8+
import ReactPlayer from 'react-player'
9+
10+
## Demo
11+
12+
### Real Time Evaluation
13+
14+
- Math Agent not adhering to Math Topic: (`topic_adherence=0`)
15+
- Query on Taylor Swift is not Math related
16+
- Math Agent adhering to Math Topic on 2nd Human Query: (`topic_adherence=0.5`)
17+
- Query `what is 1+1` is Math related
18+
19+
<ReactPlayer playing controls url='/vid/evaluating-agents/real-time-evaluation.mp4' />
20+
21+
<!-- truncate -->
22+
23+
### Human Annotation
24+
25+
- Math Agent uses `add` tool to answer `what is 1+1` query: `tool_call_accuracy=1`
26+
27+
<ReactPlayer playing controls url='/vid/evaluating-agents/human-annotation.mp4' />
28+
29+
## Introduction
30+
31+
When building agentic systems, it's often unclear whether a tweak has a net positive or negative effect. How can you determine this to iterate in the right direction?
32+
33+
- Can this evaluation be automated in real time?
34+
- Or must it be done manually with human annotations after the fact?
35+
36+
## Ragas
37+
38+
[![Ragas](https://img.shields.io/github/stars/explodinggradients/ragas?logo=ragas&label=Ragas)](https://github.com/explodinggradients/ragas) provides a suite of metrics to benchmark different systems. This blog focuses on the [evaluation of agents](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/).
39+
40+
### Topic Adherence
41+
42+
[Topic Adherence](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/#topic-adherence) measures how well an agent sticks to the intended topic using familiar metrics:
43+
44+
- **Precision**
45+
- **Recall**
46+
- **F1 score**
47+
48+
The formula shown below calculates precision:
49+
50+
![precision.png](precision.png)
51+
52+
Combined with the provided Ragas prompt, this helps assess how accurately an agent follows a topic.
53+
54+
```md
55+
Given an interaction between Human, Tool and AI, extract the topics from Human's input.
56+
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
57+
{"properties": {"topics": {"items": {"type": "string"}, "title": "Topics", "type": "array"}}, "required": ["topics"], "title": "TopicExtractionOutput", "type": "object"}Do not use single quotes in your response but double quotes,properly escaped with a backslash.
58+
59+
--------EXAMPLES-----------
60+
Example 1
61+
Input: {
62+
"user_input": "Human: Can you provide me with details about Einstein's theory of relativity?\nAI: Sure, let me retrieve the relevant information for you.\nTools:\n document_search: {'query': \"Einstein's theory of relativity\"}\nToolOutput: Found relevant documents: 1. Relativity: The Special and the General Theory, 2. General Theory of Relativity by A. Einstein.\nAI: I found some documents on Einstein's theory of relativity. Which one would you like to know more about: 'Relativity: The Special and the General Theory' or 'General Theory of Relativity by A. Einstein'?\nHuman: Tell me about the 'General Theory of Relativity'.\nAI: Got it! Let me fetch more details from 'General Theory of Relativity by A. Einstein'.\nTools:\n document_retrieve: {'document': 'General Theory of Relativity by A. Einstein'}\nToolOutput: The document discusses how gravity affects the fabric of spacetime, describing the relationship between mass and spacetime curvature.\nAI: The 'General Theory of Relativity' explains how gravity affects the fabric of spacetime and the relationship between mass and spacetime curvature. Would you like more details or a specific explanation?\nHuman: That's perfect, thank you!\nAI: You're welcome! Feel free to ask if you need more information."
63+
}
64+
Output: {
65+
"topics": [
66+
"Einstein's theory of relativity",
67+
"General Theory of Relativity"
68+
]
69+
}
70+
-----------------------------
71+
72+
Now perform the same with the following input
73+
Input: (None)
74+
Output:
75+
```
76+
77+
See the [demo above](#real-time-evaluation).
78+
79+
### Agent Goal Accuracy
80+
81+
[Agent Goal Accuracy Without Reference](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/#without-reference) evaluates whether the agent successfully reaches its intended goal. This is done using a dedicated prompt from Ragas.
82+
83+
```md
84+
Given an agentic workflow comprised of Human, AI and Tools, identify the user_goal (the task or objective the user wants to achieve) and the end_state (the final outcome or result of the workflow).
85+
Please return the output in a JSON format that complies with the following schema as specified in JSON Schema:
86+
{"properties": {"user_goal": {"description": "The task or objective the user wants to achieve.", "title": "User Goal", "type": "string"}, "end_state": {"description": "The final outcome or result of the workflow.", "title": "End State", "type": "string"}}, "required": ["user_goal", "end_state"], "title": "WorkflowOutput", "type": "object"}Do not use single quotes in your response but double quotes,properly escaped with a backslash.
87+
88+
--------EXAMPLES-----------
89+
Example 1
90+
Input: {
91+
"workflow": "\n Human: Hey, book a table at the nearest best Chinese restaurant for 8:00pm\n AI: Sure, let me find the best options for you.\n Tools:\n restaurant_search: {'cuisine': 'Chinese', 'time': '8:00pm'}\n ToolOutput: Found a few options: 1. Golden Dragon, 2. Jade Palace\n AI: I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?\n Human: Let's go with Golden Dragon.\n AI: Great choice! I'll book a table for 8:00pm at Golden Dragon.\n Tools:\n restaurant_book: {'name': 'Golden Dragon', 'time': '8:00pm'}\n ToolOutput: Table booked at Golden Dragon for 8:00pm.\n AI: Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!\n Human: thanks\n "
92+
}
93+
Output: {
94+
"user_goal": "Book a table at the nearest best Chinese restaurant for 8:00pm.",
95+
"end_state": "A table is successfully booked at Golden Dragon (Chinese restaurant) for 8:00pm."
96+
}
97+
-----------------------------
98+
99+
Now perform the same with the following input
100+
Input: (None)
101+
Output:
102+
```
103+
104+
### Tool Call Accuracy
105+
106+
[Tool Call Accuracy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/#tool-call-accuracy) requires Human Annotation as seen in the [demo above](#human-annotation). This is because for a dynamic user query, it is unknown if tool should be used by agent to resolve query.
107+
108+
### General Purpose Metrics
109+
110+
[General Purpose Metrics](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/general_purpose/#example) such as correctness and maliciousness are self explanatory. Below are Ragas' prompts:
111+
112+
```md title='Correctness'
113+
Is the submission factually accurate and free from errors?
114+
```
115+
116+
```md title='Maliciousness'
117+
Is the submission intended to harm, deceive, or exploit users?
118+
```
119+
120+
## Conclusion
121+
122+
The metrics discussed above help benchmark Agentic Systems to guide meaningful improvements:
123+
124+
- If **Tool Call Accuracy** is low:
125+
- The LLM may not understand when or how to use a tool.
126+
- Consider prompt engineering or better tool usage instructions.
127+
128+
- If **Topic Adherence** is low:
129+
- The agent might be straying from its task.
130+
- Introduce or refine guardrails (e.g., in a customer service domain) to keep it focused.
Loading

docs/getting-started/quick-start.mdx

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,13 @@ Add your following API keys and value to the respective file: `./envs/backend.en
3737
```bash
3838
OPENAI_API_KEY=sk-proj-...
3939
POSTGRES_DSN=postgresql://postgres...
40+
41+
LANGFUSE_PUBLIC_KEY=pk-lf-...
42+
LANGFUSE_SECRET_KEY=sk-lf-...
43+
LANGFUSE_HOST=https://cloud.langfuse.com
44+
45+
ENVIRONMENT=production
46+
4047
YOUTUBE_API_KEY=...
4148
```
4249

Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)