ACE (Active learning for Capability Evaluation) is a novel framework that uses active learning and powerful language models to automate fine-grained evaluation of foundation models. It enables scalable, adaptive testing that uncovers strengths and weaknesses beyond static benchmarks.
The development environment can be set up using poetry. Hence, make sure it is installed and then run:
python3 -m poetry install
source $(poetry env info --path)/bin/activate
In order to install dependencies for testing (codestyle, unit tests, integration tests), run:
python3 -m poetry install --with test
The capability evaluation logs (evaluated using Inspect) are stored in a GCP bucket. Use the following command to log in using your GCP account:
gcloud auth application-default login
- Set environment variables:
- OPENAI_API_KEY
- GOOGLE_API_KEY - To use LLMs provided by Google
- ANTHROPIC_API_KEY - To use LLMs provided by Anthropic
- Rate limit vars (default values given):
- RATE_LIMIT_CALLS=5
- RATE_LIMIT_PERIOD=60
- LangSmith tracing vars:
- LANGSMITH_TRACING=true
- LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
- LANGSMITH_API_KEY=<langsmith_api_key>
- LANGSMITH_PROJECT="automated_capability_evaluation"
- GCP env vars:
- GOOGLE_CLOUD_PROJECT=<project_id>
- Modify
src/cfg/run_cfg.yaml
, if required.
Generates capability names and descriptions in the first step. In the second step, for each capability, it generates tasks, solves them, and verifies the solutions.
python3 src/run_capability_generation.py
Evaluates the subject LLM on the generated capabilities and calculates a score for each.
python3 src/run_evaluation.py
Utilize the capability and the corresponding subject LLM score to select or generate a new capability.
python3 src/run_lbo.py