ACE

ACE (Active learning for Capability Evaluation) is a novel framework that uses active learning and powerful language models to automate fine-grained evaluation of foundation models. It enables scalable, adaptive testing that uncovers strengths and weaknesses beyond static benchmarks.

Installing dependencies

The development environment can be set up using poetry. Hence, make sure it is installed and then run:

python3 -m poetry install
source $(poetry env info --path)/bin/activate

In order to install dependencies for testing (codestyle, unit tests, integration tests), run:

python3 -m poetry install --with test

[Optional] Google Cloud Authentication

The capability evaluation logs (evaluated using Inspect) are stored in a GCP bucket. Use the following command to log in using your GCP account:

gcloud auth application-default login

Run pipeline

Configuration

Set environment variables:

OPENAI_API_KEY
GOOGLE_API_KEY - To use LLMs provided by Google
ANTHROPIC_API_KEY - To use LLMs provided by Anthropic
Rate limit vars (default values given):
- RATE_LIMIT_CALLS=5
- RATE_LIMIT_PERIOD=60
LangSmith tracing vars:
- LANGSMITH_TRACING=true
- LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
- LANGSMITH_API_KEY=<langsmith_api_key>
- LANGSMITH_PROJECT="automated_capability_evaluation"
GCP env vars:
- GOOGLE_CLOUD_PROJECT=<project_id>

Modify src/cfg/run_cfg.yaml, if required.

Capability Generation using the scientist LLM

Generates capability names and descriptions in the first step. In the second step, for each capability, it generates tasks, solves them, and verifies the solutions.

python3 src/run_capability_generation.py

Evaluation of subject LLM on generated capabilities

Evaluates the subject LLM on the generated capabilities and calculates a score for each.

python3 src/run_evaluation.py

Capability selection/generation using active learning

Utilize the capability and the corresponding subject LLM score to select or generate a new capability.

python3 src/run_lbo.py

Name		Name	Last commit message	Last commit date
Latest commit History 368 Commits
.github		.github
example_scripts		example_scripts
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
codecov.yml		codecov.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ACE

Installing dependencies

[Optional] Google Cloud Authentication

Run pipeline

Configuration

Capability Generation using the scientist LLM

Evaluation of subject LLM on generated capabilities

Capability selection/generation using active learning

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

VectorInstitute/automated_capability_evaluation

Folders and files

Latest commit

History

Repository files navigation

ACE

Installing dependencies

[Optional] Google Cloud Authentication

Run pipeline

Configuration

Capability Generation using the scientist LLM

Evaluation of subject LLM on generated capabilities

Capability selection/generation using active learning

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages