Skip to content

VectorInstitute/automated_capability_evaluation

Repository files navigation

ACE

ACE (Active learning for Capability Evaluation) is a novel framework that uses active learning and powerful language models to automate fine-grained evaluation of foundation models. It enables scalable, adaptive testing that uncovers strengths and weaknesses beyond static benchmarks.

Installing dependencies

The development environment can be set up using poetry. Hence, make sure it is installed and then run:

python3 -m poetry install
source $(poetry env info --path)/bin/activate

In order to install dependencies for testing (codestyle, unit tests, integration tests), run:

python3 -m poetry install --with test

[Optional] Google Cloud Authentication

The capability evaluation logs (evaluated using Inspect) are stored in a GCP bucket. Use the following command to log in using your GCP account:

gcloud auth application-default login

Run pipeline

Configuration

  1. Set environment variables:
  • OPENAI_API_KEY
  • GOOGLE_API_KEY - To use LLMs provided by Google
  • ANTHROPIC_API_KEY - To use LLMs provided by Anthropic
  • Rate limit vars (default values given):
    • RATE_LIMIT_CALLS=5
    • RATE_LIMIT_PERIOD=60
  • LangSmith tracing vars:
    • LANGSMITH_TRACING=true
    • LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
    • LANGSMITH_API_KEY=<langsmith_api_key>
    • LANGSMITH_PROJECT="automated_capability_evaluation"
  • GCP env vars:
    • GOOGLE_CLOUD_PROJECT=<project_id>
  1. Modify src/cfg/run_cfg.yaml, if required.

Capability Generation using the scientist LLM

Generates capability names and descriptions in the first step. In the second step, for each capability, it generates tasks, solves them, and verifies the solutions.

python3 src/run_capability_generation.py

Evaluation of subject LLM on generated capabilities

Evaluates the subject LLM on the generated capabilities and calculates a score for each.

python3 src/run_evaluation.py

Capability selection/generation using active learning

Utilize the capability and the corresponding subject LLM score to select or generate a new capability.

python3 src/run_lbo.py

About

A repository for research project on generating LLM evaluation benchmarks using LLMs

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages