Skip to content

Commit b4bfd6c

Browse files
authored
feat: Support default Letta judge agent with new letta_judge grader kind (#86)
1 parent 2a4fd4a commit b4bfd6c

File tree

10 files changed

+422
-14
lines changed

10 files changed

+422
-14
lines changed

examples/letta-agent-rubric-grader/README.md

Lines changed: 83 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -69,36 +69,108 @@ Then run the evaluation as above.
6969
7070
### Suite Configuration
7171
72+
This example provides two configuration approaches:
73+
74+
#### Option 1: Default Letta Judge (Recommended for Most Use Cases)
75+
76+
**File:** `default_judge_suite.yaml`
77+
78+
The simplest configuration uses the built-in default judge agent with pre-fetched webpage content:
79+
80+
```yaml
81+
name: fetch-webpage-default-judge-test
82+
dataset: dataset.csv
83+
target:
84+
kind: agent
85+
agent_file: test-fetch-webpage-simple-agent.af
86+
graders:
87+
agent_judge:
88+
kind: letta_judge # Use letta_judge kind
89+
prompt_path: default_judge_rubric.txt # Rubric with hardcoded webpage content
90+
extractor: last_assistant
91+
gate:
92+
metric_key: agent_judge
93+
op: gte
94+
value: 0.7
95+
```
96+
97+
**How it works:**
98+
- Uses the default Letta judge agent (no custom agent_file needed)
99+
- Rubric includes the **hardcoded webpage content** for grading
100+
- Judge reads the content directly from the rubric prompt
101+
- Simpler, faster, and more reliable (no live web requests)
102+
103+
**When to use:** Most evaluation scenarios where you know the expected content ahead of time.
104+
105+
#### Option 2: Custom Letta Judge with Live Web Search (Advanced)
106+
107+
**File:** `suite.yaml`
108+
109+
For advanced scenarios where the judge needs to dynamically verify information:
110+
72111
```yaml
73112
name: fetch-webpage-agent-judge-test
74-
description: Test agent responses using a Letta agent as judge with rubric grading
75113
dataset: dataset.csv
76114
target:
77115
kind: agent
78116
agent_file: test-fetch-webpage-simple-agent.af
79-
base_url: http://localhost:8283
80117
graders:
81118
agent_judge:
82-
kind: rubric
83-
agent_file: judge.af # Judge agent with submit_grade tool
84-
prompt_path: rubric.txt # Rubric criteria for evaluation
85-
judge_tool_name: submit_grade # Tool the judge uses to submit scores
86-
extractor: last_assistant # Extract agent's final response
119+
kind: letta_judge
120+
agent_file: custom_web_search_judge.af # Custom judge with web tools
121+
prompt_path: custom_judge_rubric.txt # Rubric instructs live fetching
122+
judge_tool_name: submit_grade
123+
extractor: last_assistant
87124
gate:
88125
metric_key: agent_judge
89126
op: gte
90-
value: 0.75 # Pass if avg score ≥ 0.75
127+
value: 0.7
91128
```
92129

130+
**How it works:**
131+
- Uses a **custom judge agent** with `fetch_webpage` tool capabilities
132+
- Rubric instructs the judge to **fetch the webpage live** during grading
133+
- Judge performs real-time web requests to verify agent answers
134+
- More dynamic but slower and depends on network availability
135+
136+
**When to use:** When evaluating against dynamic content, testing web-fetching capabilities, or when ground truth can't be pre-determined.
137+
138+
**Key Differences:**
139+
140+
| Aspect | Default Judge | Custom Judge |
141+
|--------|--------------|--------------|
142+
| **Agent** | Built-in default | Custom with web tools |
143+
| **Rubric** | Hardcoded content | Instructions to fetch live |
144+
| **Speed** | Faster (no web requests) | Slower (live fetching) |
145+
| **Reliability** | Higher (offline) | Lower (network dependent) |
146+
| **Use Case** | Static evaluation | Dynamic verification |
147+
| **Config Complexity** | Minimal (2 required fields) | Higher (4+ fields) |
148+
93149
**Key Configuration Options:**
94-
- `agent_file`: Path to `.af` file containing the judge agent
150+
- `kind`: Must be `letta_judge` for agent-based judges
151+
- `agent_file`: (Optional) Path to custom `.af` judge agent. If omitted, uses default judge
95152
- `prompt_path`: Path to file containing rubric text (can also use `prompt` for inline rubric)
96-
- `judge_tool_name`: Name of the tool the judge calls to submit scores (default: `submit_grade`)
153+
- `judge_tool_name`: (Optional) Name of the tool the judge calls to submit scores. Only allowed with custom `agent_file`
97154
- `extractor`: How to extract the submission from agent trajectory
98155

99156
### Judge Agent Requirements & Gotchas
100157

101-
#### ✅ Checklist: Will Your Judge Agent Work?
158+
**Recommendation: Use the Default Judge**
159+
160+
We **highly recommend** using the default Letta judge (Option 1) for most use cases. Configuring a custom judge agent is complex and error-prone, with several potential footguns:
161+
- Tool schema must exactly match expected parameters
162+
- Tool name must be correctly specified
163+
- Agent must not have conflicting tools that confuse evaluation
164+
- Additional complexity in debugging when things go wrong
165+
166+
**Only create a custom judge if you have a specific need**, such as:
167+
- Judge needs to fetch live web content for verification
168+
- Judge requires access to custom tools (databases, APIs, etc.)
169+
- Special evaluation logic that can't be expressed in the rubric alone
170+
171+
If you do need a custom judge, use this checklist:
172+
173+
#### Checklist: Will Your Judge Agent Work?
102174

103175
Use this checklist to verify your judge agent is properly configured:
104176

examples/letta-agent-rubric-grader/suite.yaml renamed to examples/letta-agent-rubric-grader/custom_judge_suite.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@ target:
77
base_url: http://localhost:8283
88
graders:
99
agent_judge:
10-
kind: rubric
11-
agent_file: judge.af
12-
prompt_path: rubric.txt
10+
kind: letta_judge
11+
agent_file: custom_web_search_judge.af
12+
prompt_path: custom_judge_rubric.txt
1313
judge_tool_name: submit_grade
1414
extractor: last_assistant
1515
gate:
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
Evaluate the agent's response based on the following criteria:
2+
3+
1. **Correctness (0.6 weight)**: Does the response contain accurate information from the webpage? Check if the answer matches what was requested in the input.
4+
5+
2. **Format (0.2 weight)**: Is the response formatted correctly? The input often requests answers in a specific format (e.g., in brackets like {Answer}).
6+
7+
3. **Completeness (0.2 weight)**: Does the response fully address the question without unnecessary information?
8+
9+
Scoring Guidelines:
10+
- 1.0: Perfect response - correct, properly formatted, and complete
11+
- 0.75-0.99: Good response - minor formatting or completeness issues
12+
- 0.5-0.74: Adequate response - correct information but format/completeness problems
13+
- 0.25-0.49: Poor response - partially correct or missing key information
14+
- 0.0-0.24: Failed response - incorrect or no relevant information
15+
16+
Below is the content of the webpage that the agent fetched in order to answer the question. Please review this content to grade correctness:
17+
18+
```
19+
Title: webpage1
20+
21+
URL Source: https://www.york.ac.uk/teaching/cws/wws/webpage1.html
22+
23+
Markdown Content:
24+
STARTING . . .
25+
--------------
26+
27+
There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language - HTML.
28+
29+
HTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!
30+
31+
Learning HTML will enable you to:
32+
33+
* create your own simple pages
34+
* read and appreciate pages created by others
35+
* develop an understanding of the creative and literary implications of web-texts
36+
* have the confidence to branch out into more complex web design
37+
38+
A HTML web page is made up of tags. Tags are placed in brackets like this **< tag >**. A tag tells the browser how to display information. Most tags need to be opened < tag > and closed < /tag >.
39+
40+
To make a simple web page you need to know only four tags:
41+
42+
* < HTML > tells the browser your page is written in HTML format
43+
* < HEAD > this is a kind of preface of vital information that doesn't appear on the screen.
44+
* < TITLE >Write the title of the web page here - this is the information that viewers see on the upper bar of their screen. (I've given this page the title 'webpage1').
45+
* < BODY >This is where you put the content of your page, the words and pictures that people read on the screen.
46+
47+
All these tags need to be closed.
48+
49+
#### EXERCISE
50+
51+
Write a simple web page.
52+
53+
Copy out exactly the HTML below, using a WP program such as Notepad.
54+
55+
Information in _italics_ indicates where you can insert your own text, other information is HTML and needs to be exact. However, make sure there are no spaces between the tag brackets and the text inside.
56+
57+
(Find Notepad by going to the START menu\ PROGRAMS\ ACCESSORIES\ NOTEPAD).
58+
59+
< HTML >
60+
61+
< HEAD >
62+
63+
< TITLE >_title of page_< /TITLE >
64+
65+
< /HEAD >
66+
67+
< BODY>
68+
69+
_write what you like here: 'my first web page', or a piece about what you are reading, or a few thoughts on the course, or copy out a few words from a book or cornflake packet. Just type in your words using no extras such as bold, or italics, as these have special HTML tags, although you may use upper and lower case letters and single spaces._
70+
71+
< /BODY >
72+
73+
< /HTML >
74+
75+
Save the file as 'first.html' (ie. call the file anything at all) It's useful if you start a folder - just as you would for word-processing - and call it something like WEBPAGES, and put your first.html file in the folder.
76+
77+
NOW - open your browser.
78+
79+
On Netscape the process is:
80+
81+
Top menu; FILE\ OPEN PAGE\ CHOOSE FILE
82+
83+
Click on your WEBPAGES folder\ FIRST file
84+
85+
Click 'open' and your page should appear.
86+
87+
On Internet Explorer:
88+
89+
Top menu; FILE\ OPEN\ BROWSE
90+
91+
Click on your WEBPAGES folder\ FIRST file
92+
93+
Click 'open' and your page should appear.
94+
95+
If the page doesn't open, go back over your notepad typing and make sure that all the HTML tags are correct. Check there are no spaces between tags and internal text; check that all tags are closed; check that you haven't written < HTLM > or < BDDY >. Your page will work eventually.
96+
97+
Make another page. Call it somethingdifferent.html and place it in the same WEBPAGES folder as detailed above.
98+
99+
start formatting in [lesson two](https://www.york.ac.uk/teaching/cws/wws/webpage2.html)
100+
101+
[back to wws index](https://www.york.ac.uk/teaching/cws/wws/col3.html)
102+
```
103+
104+
Use the submit_grade tool to submit your evaluation with a score between 0.0 and 1.0.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
name: fetch-webpage-default-judge-test
2+
description: Test agent responses using the default Letta judge with rubric grading
3+
dataset: dataset.csv
4+
target:
5+
kind: agent
6+
agent_file: test-fetch-webpage-simple-agent.af
7+
base_url: http://localhost:8283
8+
graders:
9+
agent_judge:
10+
kind: letta_judge
11+
prompt_path: default_judge_rubric.txt
12+
extractor: last_assistant
13+
gate:
14+
metric_key: agent_judge
15+
op: gte
16+
value: 0.7

0 commit comments

Comments
 (0)