Fixed Annotator Model Deployment when reproducing MedHELM

I was following [MedHELM docs](https://crfm-helm.readthedocs.io/en/latest/medhelm/) and [Doc on Reproducing Leaderboard](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/) to try to recreate MedHELM leaderboard on ACI-Bench dataset, but using local HF models such as Llama-3.1. The inference with local models ran smoothly, but I am having tough time trying to get LLM-Jury Annotators to run. It keeps spitting out warnings like "Could not find key 'stanfordhealthcare/llama-3.3-70b-instruct' under key 'deployments' in credentials.conf". I actually did write that in credentials.conf file, and it kept saying that it couldn't find the key. (I wrote down anthropic and openai models in same fashion and they ran fine, so this was perplexing)
More fundamental problem is that I don't think there's a way to select LLM-Jury models or model deployments myself. This is very frustrating since all 3 annotator LLMs keep defaulting to fixed openai, llama, and anthropic models found in MedHELM paper, and not only that, they always defaults to stanfordhealthcare/ deployments which I don't have access to. I found out that models and model deployments for ACI-Bench LLM Jury are hard-coded in 'src/helm/benchmark/annotation/aci_bench_annotator.py' as:

```
ANNOTATOR_MODELS: Dict[str, AnnotatorModelInfo] = {
    "gpt": AnnotatorModelInfo(
        model_name="openai/gpt-4o-2024-05-13",
        model_deployment="stanfordhealthcare/gpt-4o-2024-05-13",
    ),
    "llama": AnnotatorModelInfo(
        model_name="meta/llama-3.3-70b-instruct",
        model_deployment="stanfordhealthcare/llama-3.3-70b-instruct",
    ),
    "claude": AnnotatorModelInfo(
        model_name="anthropic/claude-3-7-sonnet-20250219",
        model_deployment="stanfordhealthcare/claude-3-7-sonnet-20250219",
    ),
}
```


Is there any way I can bypass or re-declare those models and model deployments to something else?

I have been using the following codes to run the framework until now:


`prod_env/credentials.conf`
```
openaiApiKey: "..."
anthropicApiKey: "..."

"deployments": {
  "stanfordhealthcare/gpt-4o-2024-05-13": "...",
  "stanfordhealthcare/claude-3-7-sonnet-20250219": "...",
  "stanfordhealthcare/llama-3.3-70b-instruct": "..."
}
```

`prod_env/model_deployments.yaml`
```
model_deployments:
  - name: huggingface/llama-3.2-1b-instruct
    model_name: meta/llama-3.2-1b-instruct
    tokenizer_name: meta/llama-3.2-1b-instruct
    max_sequence_length: 131072
    client_spec:
      class_name: "helm.clients.huggingface_client.HuggingFaceClient"
      args:
        pretrained_model_name_or_path: meta-llama/Llama-3.2-1B-Instruct
        torch_dtype: "float16"

- name: stanfordhealthcare/gpt-4o-2024-05-13
    model_name: openai/gpt-4.1-2025-04-14
    tokenizer_name: openai/o200k_base
    max_sequence_length: 1047576
    host_organization: openai
    client_spec:
      class_name: "helm.clients.openai_client.OpenAIClient"

  - name: stanfordhealthcare/claude-3-7-sonnet-20250219
    model_name: anthropic/claude-3-7-sonnet-20250219
    tokenizer_name: anthropic/claude
    max_sequence_length: 200000
    host_organization: anthropic
    client_spec:
      class_name: "helm.clients.anthropic_client.AnthropicMessagesClient"

  - name: stanfordhealthcare/llama-3.3-70b-instruct
    model_name: openai/gpt-4.1-2025-04-14
    tokenizer_name: openai/o200k_base
    max_sequence_length: 1047576
    host_organization: openai
    client_spec:
      class_name: "helm.clients.openai_client.OpenAIClient"
```

`./reproduce_leaderboard.sh`
```
# Pick any suite name of your choice
export SUITE_NAME=my-medhelm-suite

# Replace this with your model or models
export MODELS_TO_RUN=meta/llama-3.2-1b-instruct

# Get these from the list below
export RUN_ENTRIES_CONF_PATH=run_entries_medhelm_public.conf
export SCHEMA_PATH=schema_medhelm.yaml
export NUM_TRAIN_TRIALS=1
export MAX_EVAL_INSTANCES=10
export PRIORITY=2

helm-run --conf-paths $RUN_ENTRIES_CONF_PATH --num-train-trials $NUM_TRAIN_TRIALS --max-eval-instances $MAX_EVAL_INSTANCES --priority $PRIORITY --suite $SUITE_NAME --models-to-run $MODELS_TO_RUN

helm-summarize --schema $SCHEMA_PATH --suite $SUITE_NAME

helm-server --suite $SUITE_NAME
```

`./run_entries_medhelm_public.conf`
```
entries: [
  {description: "aci_bench:model=meta/llama-3.1-8b-instruct,model_deployment=huggingface/llama-3.1-8b-instruct", priority: 1},
  {description: "aci_bench:model=meta/llama-3.2-3b-instruct,model_deployment=huggingface/llama-3.2-3b-instruct", priority: 1},
  {description: "aci_bench:model=meta/llama-3.2-1b-instruct,model_deployment=huggingface/llama-3.2-1b-instruct", priority: 1},
]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed Annotator Model Deployment when reproducing MedHELM #3798

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fixed Annotator Model Deployment when reproducing MedHELM #3798

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions