You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: web/README.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,7 +45,7 @@ the `weight` captures the score's importance in the final score. the list of sco
45
45
- ./.venv/bin/flake8 datadog_lambda/
46
46
```
47
47
48
-
any commit between `from` (exclusive) and `to` (inclusive) has a specific prompt (task) that the agent should act on. these prompts might change on a monthly basis by a maintainer so we avoid making the benchmarks too deterministic. the prompts change slightly so this does not make old benchmarks incomparable to the new benchmarks even though that's not the goal of OpenCode-bench because with this benchmark we're trying to compare agents & models to each other.
48
+
any commit between `from` (exclusive) and `to` (inclusive) has a specific prompt (task) that the agent should act on. these prompts might change on a monthly basis by a maintainer so we avoid making the benchmarks too deterministic. the prompts change slightly so this does not make old benchmarks incomparable to the new benchmarks even though that's not the goal of OpenCode-bench because with this benchmark we're trying to compare agents & models to each other, rather than compare them to the old themselves.
49
49
50
50
```yaml
51
51
generated_at: 2025-11-04T01:45:24.286Z
@@ -64,9 +64,9 @@ since each commit has a benchmark execution with it, we let the user navigate be
64
64
65
65
by default, the last run is shown on the home page as the main information. but that can change through navigating the commit history to show the benchmarks of a 1 month old run for instance.
66
66
67
-
each run shows a per agent and a per model comparison/chart is formed by aggregating the scores of each combination per eval. so the user is able to see more specific information that is specific to a single eval.
67
+
each run shows a per combination comparison/chart is formed by aggregating the scores of each combination per eval. so the user is able to see more specific information that is specific to a single eval.
68
68
69
-
we store a per agent:model analysis summary as well that talks about how that agent behaved in a specific run. there's also a difference analysis summary that is per eval, which talks about how different agents/models behvaed in that eval.
69
+
we store a per combination (agent:model) analysis summary as well that talks about how that agent behaved in a specific run. there's also a difference analysis summary that is per eval, which talks about how different agents/models behvaed in that eval.
70
70
71
71
[scatter charts](https://recharts.github.io/en-US/examples/SimpleScatterChart/) are often used to demonstrate the performace of AI models compared to each other.
0 commit comments