-
Notifications
You must be signed in to change notification settings - Fork 2
Description
right now, we only have 3 evals which only one of them was contributed by the community and has the most quality since it was an actual task shipped by an agent and reviewed by a human.
as far as i remember in a discussion with @tmickleydoyle and @thdxr, our goal is to reach 25 evals for the official release. this would not only help us with achieving an accurate benchmark, but it'd stabilize the outputs as well.
the increase would also help us to get some contribution from the community on what they want agents to achieve and help us understand what they define a good agent and a good model is.
the end goal i vaguely remember in terms of the number of the evals, was going to 100 and beyond which would be amazing. the more we add, the more difficult the benchmark would be and the most relevant we will keep OpenCode-bench. outdated and irrelevant evals and metrics is the curse that targets most of the current benchmarks which make them not reflect the "real" progress in the industry.