This is a benchmark to evaluate AI capabilities to do fair data driven decision-making.
The benchmark consists of several tasks.
Roles:
- task-specific: environment files for the task, the train.py, etc
- benchmarking infrastructure: code needed to overall run benchmark, scoring etc (
eval-<type>.py
) - agent: agent tools, agent prompts, etc
file | description | role |
---|