Skip to content

Fixed Annotator Model Deployment when reproducing MedHELM #3798

@wjkim9653

Description

@wjkim9653

I was following MedHELM docs and Doc on Reproducing Leaderboard to try to recreate MedHELM leaderboard on ACI-Bench dataset, but using local HF models such as Llama-3.1. The inference with local models ran smoothly, but I am having tough time trying to get LLM-Jury Annotators to run. It keeps spitting out warnings like "Could not find key 'stanfordhealthcare/llama-3.3-70b-instruct' under key 'deployments' in credentials.conf". I actually did write that in credentials.conf file, and it kept saying that it couldn't find the key. (I wrote down anthropic and openai models in same fashion and they ran fine, so this was perplexing)
More fundamental problem is that I don't think there's a way to select LLM-Jury models or model deployments myself. This is very frustrating since all 3 annotator LLMs keep defaulting to fixed openai, llama, and anthropic models found in MedHELM paper, and not only that, they always defaults to stanfordhealthcare/ deployments which I don't have access to. I found out that models and model deployments for ACI-Bench LLM Jury are hard-coded in 'src/helm/benchmark/annotation/aci_bench_annotator.py' as:

ANNOTATOR_MODELS: Dict[str, AnnotatorModelInfo] = {
    "gpt": AnnotatorModelInfo(
        model_name="openai/gpt-4o-2024-05-13",
        model_deployment="stanfordhealthcare/gpt-4o-2024-05-13",
    ),
    "llama": AnnotatorModelInfo(
        model_name="meta/llama-3.3-70b-instruct",
        model_deployment="stanfordhealthcare/llama-3.3-70b-instruct",
    ),
    "claude": AnnotatorModelInfo(
        model_name="anthropic/claude-3-7-sonnet-20250219",
        model_deployment="stanfordhealthcare/claude-3-7-sonnet-20250219",
    ),
}

Is there any way I can bypass or re-declare those models and model deployments to something else?

I have been using the following codes to run the framework until now:

prod_env/credentials.conf

openaiApiKey: "..."
anthropicApiKey: "..."

"deployments": {
  "stanfordhealthcare/gpt-4o-2024-05-13": "...",
  "stanfordhealthcare/claude-3-7-sonnet-20250219": "...",
  "stanfordhealthcare/llama-3.3-70b-instruct": "..."
}

prod_env/model_deployments.yaml

model_deployments:
  - name: huggingface/llama-3.2-1b-instruct
    model_name: meta/llama-3.2-1b-instruct
    tokenizer_name: meta/llama-3.2-1b-instruct
    max_sequence_length: 131072
    client_spec:
      class_name: "helm.clients.huggingface_client.HuggingFaceClient"
      args:
        pretrained_model_name_or_path: meta-llama/Llama-3.2-1B-Instruct
        torch_dtype: "float16"

- name: stanfordhealthcare/gpt-4o-2024-05-13
    model_name: openai/gpt-4.1-2025-04-14
    tokenizer_name: openai/o200k_base
    max_sequence_length: 1047576
    host_organization: openai
    client_spec:
      class_name: "helm.clients.openai_client.OpenAIClient"

  - name: stanfordhealthcare/claude-3-7-sonnet-20250219
    model_name: anthropic/claude-3-7-sonnet-20250219
    tokenizer_name: anthropic/claude
    max_sequence_length: 200000
    host_organization: anthropic
    client_spec:
      class_name: "helm.clients.anthropic_client.AnthropicMessagesClient"

  - name: stanfordhealthcare/llama-3.3-70b-instruct
    model_name: openai/gpt-4.1-2025-04-14
    tokenizer_name: openai/o200k_base
    max_sequence_length: 1047576
    host_organization: openai
    client_spec:
      class_name: "helm.clients.openai_client.OpenAIClient"

./reproduce_leaderboard.sh

# Pick any suite name of your choice
export SUITE_NAME=my-medhelm-suite

# Replace this with your model or models
export MODELS_TO_RUN=meta/llama-3.2-1b-instruct

# Get these from the list below
export RUN_ENTRIES_CONF_PATH=run_entries_medhelm_public.conf
export SCHEMA_PATH=schema_medhelm.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=10
export PRIORITY=2

helm-run --conf-paths $RUN_ENTRIES_CONF_PATH --num-train-trials $NUM_TRAIN_TRIALS --max-eval-instances $MAX_EVAL_INSTANCES --priority $PRIORITY --suite $SUITE_NAME --models-to-run $MODELS_TO_RUN

helm-summarize --schema $SCHEMA_PATH --suite $SUITE_NAME

helm-server --suite $SUITE_NAME

./run_entries_medhelm_public.conf

entries: [
  {description: "aci_bench:model=meta/llama-3.1-8b-instruct,model_deployment=huggingface/llama-3.1-8b-instruct", priority: 1},
  {description: "aci_bench:model=meta/llama-3.2-3b-instruct,model_deployment=huggingface/llama-3.2-3b-instruct", priority: 1},
  {description: "aci_bench:model=meta/llama-3.2-1b-instruct,model_deployment=huggingface/llama-3.2-1b-instruct", priority: 1},
]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions