-
Notifications
You must be signed in to change notification settings - Fork 44
Open
Description
Hi @Ayanami0730. Thanks for this great benchmark, which is very helpful for developing search agents.
From what I understand, there are several types of biases in LLM-as-a-judge framework. One of the more prominent ones is the length bias. And in our test, we found a positive correlation between the report length and the RACE score with different variants of implementation. So I am wondering: is this observation really an indicator of the real report quality or it also includes some spurious correlation? Does DeepResearch Bench have a way to decouple the influence of length bias?
Thanks in advance for your reply!
Metadata
Metadata
Assignees
Labels
No labels