-
Notifications
You must be signed in to change notification settings - Fork 326
Description
I was following MedHELM docs and Doc on Reproducing Leaderboard to try to recreate MedHELM leaderboard on ACI-Bench dataset, but using local HF models such as Llama-3.1. The inference with local models ran smoothly, but I am having tough time trying to get LLM-Jury Annotators to run. It keeps spitting out warnings like "Could not find key 'stanfordhealthcare/llama-3.3-70b-instruct' under key 'deployments' in credentials.conf". I actually did write that in credentials.conf file, and it kept saying that it couldn't find the key. (I wrote down anthropic and openai models in same fashion and they ran fine, so this was perplexing)
More fundamental problem is that I don't think there's a way to select LLM-Jury models or model deployments myself. This is very frustrating since all 3 annotator LLMs keep defaulting to fixed openai, llama, and anthropic models found in MedHELM paper, and not only that, they always defaults to stanfordhealthcare/ deployments which I don't have access to. I found out that models and model deployments for ACI-Bench LLM Jury are hard-coded in 'src/helm/benchmark/annotation/aci_bench_annotator.py' as:
ANNOTATOR_MODELS: Dict[str, AnnotatorModelInfo] = {
"gpt": AnnotatorModelInfo(
model_name="openai/gpt-4o-2024-05-13",
model_deployment="stanfordhealthcare/gpt-4o-2024-05-13",
),
"llama": AnnotatorModelInfo(
model_name="meta/llama-3.3-70b-instruct",
model_deployment="stanfordhealthcare/llama-3.3-70b-instruct",
),
"claude": AnnotatorModelInfo(
model_name="anthropic/claude-3-7-sonnet-20250219",
model_deployment="stanfordhealthcare/claude-3-7-sonnet-20250219",
),
}
Is there any way I can bypass or re-declare those models and model deployments to something else?
I have been using the following codes to run the framework until now:
prod_env/credentials.conf
openaiApiKey: "..."
anthropicApiKey: "..."
"deployments": {
"stanfordhealthcare/gpt-4o-2024-05-13": "...",
"stanfordhealthcare/claude-3-7-sonnet-20250219": "...",
"stanfordhealthcare/llama-3.3-70b-instruct": "..."
}
prod_env/model_deployments.yaml
model_deployments:
- name: huggingface/llama-3.2-1b-instruct
model_name: meta/llama-3.2-1b-instruct
tokenizer_name: meta/llama-3.2-1b-instruct
max_sequence_length: 131072
client_spec:
class_name: "helm.clients.huggingface_client.HuggingFaceClient"
args:
pretrained_model_name_or_path: meta-llama/Llama-3.2-1B-Instruct
torch_dtype: "float16"
- name: stanfordhealthcare/gpt-4o-2024-05-13
model_name: openai/gpt-4.1-2025-04-14
tokenizer_name: openai/o200k_base
max_sequence_length: 1047576
host_organization: openai
client_spec:
class_name: "helm.clients.openai_client.OpenAIClient"
- name: stanfordhealthcare/claude-3-7-sonnet-20250219
model_name: anthropic/claude-3-7-sonnet-20250219
tokenizer_name: anthropic/claude
max_sequence_length: 200000
host_organization: anthropic
client_spec:
class_name: "helm.clients.anthropic_client.AnthropicMessagesClient"
- name: stanfordhealthcare/llama-3.3-70b-instruct
model_name: openai/gpt-4.1-2025-04-14
tokenizer_name: openai/o200k_base
max_sequence_length: 1047576
host_organization: openai
client_spec:
class_name: "helm.clients.openai_client.OpenAIClient"
./reproduce_leaderboard.sh
# Pick any suite name of your choice
export SUITE_NAME=my-medhelm-suite
# Replace this with your model or models
export MODELS_TO_RUN=meta/llama-3.2-1b-instruct
# Get these from the list below
export RUN_ENTRIES_CONF_PATH=run_entries_medhelm_public.conf
export SCHEMA_PATH=schema_medhelm.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=10
export PRIORITY=2
helm-run --conf-paths $RUN_ENTRIES_CONF_PATH --num-train-trials $NUM_TRAIN_TRIALS --max-eval-instances $MAX_EVAL_INSTANCES --priority $PRIORITY --suite $SUITE_NAME --models-to-run $MODELS_TO_RUN
helm-summarize --schema $SCHEMA_PATH --suite $SUITE_NAME
helm-server --suite $SUITE_NAME
./run_entries_medhelm_public.conf
entries: [
{description: "aci_bench:model=meta/llama-3.1-8b-instruct,model_deployment=huggingface/llama-3.1-8b-instruct", priority: 1},
{description: "aci_bench:model=meta/llama-3.2-3b-instruct,model_deployment=huggingface/llama-3.2-3b-instruct", priority: 1},
{description: "aci_bench:model=meta/llama-3.2-1b-instruct,model_deployment=huggingface/llama-3.2-1b-instruct", priority: 1},
]