Skip to content

more evals #12

@Aslemammad

Description

@Aslemammad

right now, we only have 3 evals which only one of them was contributed by the community and has the most quality since it was an actual task shipped by an agent and reviewed by a human.

as far as i remember in a discussion with @tmickleydoyle and @thdxr, our goal is to reach 25 evals for the official release. this would not only help us with achieving an accurate benchmark, but it'd stabilize the outputs as well.

the increase would also help us to get some contribution from the community on what they want agents to achieve and help us understand what they define a good agent and a good model is.

the end goal i vaguely remember in terms of the number of the evals, was going to 100 and beyond which would be amazing. the more we add, the more difficult the benchmark would be and the most relevant we will keep OpenCode-bench. outdated and irrelevant evals and metrics is the curse that targets most of the current benchmarks which make them not reflect the "real" progress in the industry.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions