Domain generalization aims to build machine learning models that perform reliably across diverse, unseen environments. A key challenge arises from spurious correlations—features that appear informative during training but fail to generalize due to their instability across different environments. For instance, a model trained to recognize birds might incorrectly rely on backgrounds (e.g., water backgrounds for waterbirds) rather than the bird features themselves, leading to poor generalization when backgrounds change.
We investigate a fundamental issue with domain generalization benchmarks: many widely-used benchmarks may be misspecified, particularly those displaying "accuracy on the line," where better in-distribution (ID) accuracy reliably predicts better out-of-distribution (OOD) accuracy. We argue that this pattern signals that these benchmarks do not effectively test the robustness of models against spurious correlations.
We define a benchmark (an in distribution (ID) and out-of-distribution (OOD) split) as well-specified if a model relying exclusively on domain-general features (non-spurious and stable across environments) achieves better out-of-distribution generalization compared to a model exploiting spurious features.
Our simulation and real-world data results can be explore via an interactive app here.
Our findings suggest crucial considerations for the research community:
-
Benchmark Selection: Researchers should prioritize benchmarks without accuracy on the line when evaluating domain generalization.
-
Evaluation Practices: Averaging results across multiple ID/OOD splits can obscure meaningful insights, especially if only certain splits are well-specified.
-
Model Selection: Selecting models based solely on held-out accuracy may unintentionally reinforce reliance on spurious correlations.
Our work highlights the importance of critically re-evaluating and refining benchmarks to ensure they accurately measure robustness to spurious correlations, thereby paving a clearer path toward reliable model generalization under strong distribution shifts, e.g., a pandemic.
Results 1. Sufficient spurious correlation reversal gives well-specified domain generalization ID/OOD splits.
Our primary theoretical result (Theorem 1) introduces the concept of a negative margin under distribution shift, also known as spurious correlation reversal. Informally, this margin quantifies how drastically the correlation between spurious features and labels must reverse from the training distribution to the testing distribution for the ID/OOD split to be well-specified. We assume spurious features
Here,
Results 2. Sufficient spurious correlation reversal (i.e., well-specified) and Accuracy on the line are at odds. They simulataneously hold with probability 0. Accuracy on the line occurs when there's a strong linear correlation between in-distribution and out-of-distribution accuracy across various models:
Mathematically, for classifiers
Here,
Our analysis shows a fundamental trade-off (Theorem 2):
- Define
$W_\varepsilon$ as the set of shifts ($\phi$ 's) satisfying both the well-specified condition and accuracy on the line (for$\varepsilon$ ) simultaneously. Then:
Intuitively, as accuracy on the line becomes increasingly perfect (
Figure 1: Simulated results for Gaussian Spurious Features (left) and Sub-Gaussian Spurious Features (right), illustrating conditions for well-specified vs. misspecified benchmarks. Generally, without spurious correlation reversa, (
Figure 2. ID vs. OOD accuracy on probit scale. When
![]() 3a. ColoredMNIST (Env 2, R=-0.74) |
![]() 3b. Covid-CXR (Env 3, R=-0.48) |
![]() 3c. CivilComments (Env 1, R=-0.47) |
![]() 3d. Waterbirds (Env 0, R=-0.06) |
![]() 3e. Camelyon (Env 2, R=0.78) |
![]() 3f. PACS (Env 0, R=0.98) |
Figure 3: Real-world benchmarks exhibiting varying degrees of accuracy correlation, illustrating well-specified (negative or weak correlation) versus potentially misspecified (strong positive correlation) scenarios. We show some ID/OOD splits of popular domain-generalization benchmarks with a strong positive, weak, or strong negative correlation between in-distribution and out-of-distribution accuracy. Our results suggest that algorithms that consistently provide models with the best transfer accuracies for these splits are at least partially successful in removing spurious correlations.
In Figure 3, we analyze real-world benchmarks, showcasing varying correlation patterns:
- ColoredMNIST (Fig 3a): Strong negative correlation (R=-0.74) when environment 2 is OOD.
- Covid-CXR (Fig 3b): Moderate negative correlation (R=-0.48) when envirionment 3 is OOD.
- CivilComments (Fig 3c): Moderate negative correlation (R=-0.47) when environment 1 is OOD.
- Waterbirds (Fig 3d): Weak negative correlation (R=-0.06) when environment is defined as examples with the pair (y=0, a=1) is OOD.
- Camelyon (Fig 3e): Moderate positive correlation (R=0.78) when environment 2 is the pair.
- PACS (Fig 3f): Strong positive correlation (R=0.98) when environment 0 is the pair.
These empirical examples illustrate how benchmarks vary from well-specified (negative or weak correlations) to potentially misspecified (strong positive correlations). Our analysis indicates that the PACS benchmark, with its strong positive correlation (R=0.98), is likely misspecified for evaluating domain generalization. Camelyon represents an interesting intermediate case: it exhibits strong positive accuracy on the line within a certain accuracy range, but this pattern disappears among higher-accuracy models. This suggests that models primarily rely on spurious correlations after they reach sufficiently high accuracy. Such nuances emphasize the importance of qualitative analyses when selecting ID/OOD splits, as the Camelyon split may still serve as a meaningful benchmark for assessing domain generalization under high-accuracy conditions. Conversely, other benchmarks showing weak or strongly negative correlations, like ColoredMNIST, Covid-CXR, CivilComments, and Waterbirds, clearly satisfy our criteria for well-specified domain generalization benchmarks. The Waterbirds benchmark represents a special case of domain generalization known as subpopulation shift. In this scenario, we find that the worst-group accuracy—the primary metric targeted by methods designed to handle subpopulation shifts—does not exhibit accuracy on the line, especially when the worst-performing group is drawn from an out-of-distribution (OOD) environment. This observation highlights the appropriateness of worst-group accuracy as a robust evaluation metric under meaningful distribution shifts.
Table 1. Summary of ID/OOD accuracy correlations across multiple benchmarks. Importantly, these correlations are reported at the granularity of individual ID/OOD splits rather than averaged across entire benchmarks or datasets. We emphasize that evaluating performance at this granularity—specific ID/OOD splits—is crucial for effectively assessing domain generalization. For the Camelyon datasets, we find that, qualitatively, these ID/OOD splits may be well-suited for benchmarking domain genaralization, as previously discussed (see Figure 4). More splits can be found in our paper.
OOD | Slope | Intercept | Pearson R | p-value | Std. Error |
---|---|---|---|---|---|
Spawrious One-to-One Hard: Env 0 acc | 0.32 | -0.21 | 0.50 | 0.00 | 0.05 |
Spawrious Many-to-Many Hard: Env 0 acc | 0.16 | -0.04 | 0.29 | 0.00 | 0.01 |
Covid-CXR: Env 1 acc | -0.38 | 0.13 | -0.50 | 0.00 | 0.02 |
Covid-CXR: Env 3 acc | -0.60 | 0.56 | -0.48 | 0.00 | 0.03 |
Covid-CXR: Env 4 acc | 0.53 | -0.04 | 0.31 | 0.00 | 0.04 |
Waterbirds: Env 1 avg acc & Env 0 y=0,a=1 acc | -0.07 | 1.61 | -0.11 | 0.00 | 0.02 |
WILDSCamelyon: Env 0 acc | 0.78 | 0.33 | 0.90 | 0.00 | 0.01 |
WILDSCamelyon: Env 2 acc | 0.62 | 0.49 | 0.78 | 0.00 | 0.01 |
WILDSCamelyon: Env 4 acc | 0.63 | 0.40 | 0.78 | 0.00 | 0.01 |
Figure 4: ID/OOD splits for WILDSCamelyon dataset.
./valid_domain_generalization_benchmarks/README.md
cd valid_domain_generalization_benchmarks; streamlit run main.py
Our experiments are conducted using a modification of the DomainBed library. Notably, this our modification includes additional datasets and model architectures than the original domainbed library. The model architectures also include transfer learning and finetuning.
Datasets:
- WILDSCamelyon (Bandi et al., 2018; Koh et al.,2021)
- CivilComments (Borkan et al., 2019; Koh et al., 2021)
- ColoredMNIST (Arjovsky et al., 2019; Gulrajani & Lopez-Paz, 2020a)
- Covid-CXR (Alzate-Grisales et al., 2022; Cohen et al., 2020b; Tabik et al., 2020; Tahir et al., 2021; Suwalska et al., 2023)
- WILDSFMoW (Christie et al., 2018; Koh et al., 2021)
- PACS (Liet al., 2017; Gulrajani & Lopez-Paz, 2020a)
- Spawrious (Lynch et al., 2023)
- TerraIncognita (Beery et al., 2018; Gulrajani & Lopez-Paz, 2020a)
- Waterbirds (Sagawa et al., 2019)
Model Architectures:
- ResNet-18/50 (He et al., 2016)
- DenseNet-121 (Huang et al., 2017)
- Vision Transformers (Dosovitskiy et al., 2020)
- ConvNeXt-Tiny (Liu et al., 2022)
Our results only include accuracies for the 'out' split of each domain. Our results include two versions:
- x-axis: source domain accuracies individually, y-axis: target domain accuracy individually
- x-axis: average source domain accuracies, y-axis: target domain accuracy individually
To run the experiments, use the following command:
python sweep.py --datasets <dataset_names> --algorithms <algorithm_names> --n_hparams <n_hparams> --n_trials <n_trials> --model_arch <model_arch>
Example:
python sweep.py --datasets TerraIncognita --algorithms ERM --n_hparams 25 --n_trials 1 --model_arch vit_b_16