Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

This repository accompanies the paper Salaudeen, Olawale, et al. "Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?." arXiv preprint arXiv:2504.00186 (2025).

We provide:

Simulations & Plots that illustrate when and why common OOD benchmarks fail to detect reliance on spurious correlations.
A Streamlit app (main.py) for interactive exploration of accuracy-on-the-line patterns under different simulated shifts: https://misspecified-dg-benchmarks-viz.streamlit.app/

📖 Paper

Read the full paper on arXiv: https://arxiv.org/abs/2504.00186

🚀 Features

Simulation Mode (/simulation_mode): Generate toy domain splits under controlled shifts and visualize correlation patterns. Users can either mimic their expected spurious-correlation structure to produce and examine “accuracy on the line,” or interactively explore different spurious-correlation structures that yield similar ID vs. OOD accuracy patterns observed in real benchmarks. This application helps user determine if their benchmark or OOD generalization task is misspecified.
Plotting Mode (/plotting_mode): Load real benchmark data (e.g., PACS, VLCS, Waterbirds) to reproduce “accuracy on the line” plots.
Interactive App: Tune parameters (spurious-/domain-general-signal strength, shift severity, etc.) in real time and observe how Pearson R ID vs. OOD changes.

⚙️ Installation

Clone this repository:

git clone https://github.com/olawalesalaudeen/misspecified_DG_benchmarks_viz.git
cd misspecified_DG_benchmarks_viz

Running the Application

Create a virtual environment with python=3.9 and install packages using pip install -r requirements.txt.
Run streamlit run main.py

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.devcontainer		.devcontainer
data		data
plotting_mode		plotting_mode
simulation_mode		simulation_mode
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

📖 Paper

🚀 Features

⚙️ Installation

Running the Application

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

olawalesalaudeen/misspecified_DG_benchmarks_viz

Folders and files

Latest commit

History

Repository files navigation

Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

📖 Paper

🚀 Features

⚙️ Installation

Running the Application

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages