LADDER: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers

Shantanu Ghosh¹, Rayan Syed¹, Chenyu Wang¹, Vaibhav Choudhary¹, Binxu Li², Clare B. Poynton³, Shyam Visweswaran⁴ Kayhan Batmanghelich¹

¹BU ECE, ² Stanford University, ³ BUMC, ⁴ Pitt DBMI

📚 Table of Contents

TL;DR
Highlights
Warnings
Acknowledgements
Environment Setup
Dataset Zoo
Model Zoo
Downloading Classifier Checkpoints
Vision-Language Representation Space
Generating Captions
- For Natural Images
- For Medical Images
LADDER Pipeline
Demo Notebooks With Qualitative Results
Scripts
Citation
License
Contact
Contributing

📌 TL;DR

LADDER is a modular framework that uses large language models (LLMs) to discover, explain, and mitigate hidden biases in vision classifiers—without requiring prior knowledge of the biases or attribute labels.

🚨 Highlights

📊 6 Datasets Evaluated
- 🐦 Natural Images: Waterbirds, CelebA, MetaShift
- 🏥 Medical Imaging: NIH ChestX-ray, RSNA-Mammo, VinDr-Mammo
🧪 ~20 Bias Mitigation Algorithms Benchmarked
- 💡 ERM, GroupDRO, CVaR-DRO, JTT, LfF, DFR
- 🧬 CORAL, IRM, V-REx, IB-IRM, Reweighting, Mixup, AugMix
🧠 11 Architectures Across 5 Pretraining Strategies
- 🧱 CNNs: ResNet-50, EfficientNet-B5
- 🔲 ViTs: ViT-B/16, ViT-S/16
- 🧪 Pretrained With: SimCLR, Barlow Twins, DINO, CLIP (OpenAI),
  IN1K, IN21K, SWAG, LAION-2B
💬 4 LLMs for Hypothesis Generation
- 🧠 GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA 3.1 70B
Star 🌟 us if you think it is helpful!!

⚠️ Warnings

🔧 Replace all hardcoded paths like /restricted/projectnb/batmanlab/shawn24/PhD with your own directory.

Following guidelines of [MIMIC-CXR] (https://physionet.org/news/post/gpt-responsible-use), we setup google vertex ai for setting GEMINI as LLM for hypothesis generation for medical images in this codebase.

📅 All LLMs were evaluated using checkpoints available before Jan 11, 2025.
Newer versions may produce different hypotheses than those reported in the paper.

🧠 Default setup uses:

GPT-4o as captioner for the natural images

ResNet-50 for the classifier

ViT-B/32 for the vision-language representation space

GPT-4o for hypothesis generation
The code is modular and can be easily extended to other models and LLMs.
🗂️ Update cache directory locations in save_img_reps.py to your own:
os.environ['TRANSFORMERS_CACHE'] = '/your/custom/.cache/huggingface/transformers'
os.environ['TORCH_HOME'] = '/your/custom/.cache/torch'

🙏 Acknowledgements

We rely heavily on the Subpopulation Shift Benchmark (SubpopBench) codebase for:

📥 Downloading and processing datasets
🧠 Classifier training on natural image benchmarks
Note: SubpopBench does not support NIH-CXR datasets.
To address this, our codebase includes extended experiments for NIH ChestX-ray (NIH-CXR), which are discussed in subsequent sections. Necessary compatibility modifications of SubPopShift are included in our repo under src/codebase/SubpopBench-main

🛠️ Environment Setup

Use environment.yaml

git clone git@github.com:batmanlab/Ladder.git
cd Ladder
conda env create --name Ladder -f environment.yml
conda activate Ladder

📚 Dataset zoo

Please refer to the dataset_zoo.md for the details of the datasets used in this project. For toy dataset in Fig. 1, run the python script.

🧠 Model zoo

For the details of the classifiers, pretraining methods and algorithms supported by this codebase, refer to the classifier_zoo.md.

💾 Downloading Classifier Checkpoints Used in the Paper

We provide the pretrained ResNet-50 (resnet_sup_in1k) and EfficientNet-B5 (tf_efficientnet_b5_ns-detect) classifier checkpoints used in our experiments via Hugging Face Hub.

📦 Available Checkpoints by Dataset:

🤖 Vision-language representation space

We use the following vision-language representation space for our experiments:

Natual images: CLIP
Mammograms: Mammo-CLIP
Chest-X-Rays: CXR-CLIP

Download the latest checkpoints from the respective repositories.

💬 Generating captions

🌄 For Natural Images

Ladder requires captions for the images in the validation dataset. We provide a script to generate captions for the images using BLIP and GPT-4o. You can get the captions directly from the respective dataset directory in Hugging Face or generate them using the following scripts.

Using BLIP

python ./src/codebase/caption_images.py \
  --seed=0 \
  --dataset="Waterbirds" \
  --img-path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/data/waterbirds/waterbird_complete95_forest2water2" \
  --csv="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/data/waterbirds/metadata_waterbirds.csv" \
  --save_csv="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/data/waterbirds/va_metadata_waterbirds_captioning_blip.csv" \
  --split="va" \
  --captioner="blip"

Using GPT-4o

python ./src/codebase/caption_images_gpt_4.py \
  --seed=0 \
  --dataset="Waterbirds" \
  --img-path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/data/waterbirds/waterbird_complete95_forest2water2" \
  --csv="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/data/waterbirds/metadata_waterbirds.csv" \
  --save_csv="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/data/waterbirds/va_metadata_waterbirds_captioning_GPT.csv" \
  --split="va" \
  --model="gpt-4o" \
  --api_key="<open-ai key>"

🫁 For Medical Images

For NIH-CXR, we use the radiology report from MIMIC-CXR dataset. Download the metadata csv containing impression and findings from here.
For RSNA-Mammo and VinDr-Mammo, we use the radiology text from Mammo-FActOR codebase.

🪜 LADDER Pipeline

Ladder pipeline consists of 6 steps. We uploaded the outputs of every step in the huggingface. The steps are as follows:

🔁 Pipeline Overview

Step1: Save image representations of the image classifier and vision encoder from vision language representation space

python ./src/codebase/save_img_reps.py \
  --seed=0 \
  --dataset="Waterbirds" \
  --classifier="resnet_sup_in1k" \
  --classifier_check_pt="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/model.pkl" \
  --flattening-type="adaptive" \
  --clip_vision_encoder="ViT-B/32" \
  --data_dir="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/data" \
  --save_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}"

Step2: Save text representations text encoder from vision language representation space

python ./src/codebase/save_text_reps.py \
  --seed=0 \
  --dataset="Waterbirds" \
  --clip_vision_encoder="ViT-B/32" \
  --save_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}" \
  --prompt_sent_type="captioning" \
  --captioning_type="gpt-4o" \
  --prompt_csv="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/data/waterbirds/va_metadata_waterbirds_captioning_GPT.csv"

Step3: Train aligner to align the classifier and vision language image representations

python ./src/codebase/learn_aligner.py \
  --seed=0 \
  --epochs=30 \
  --dataset="Waterbirds" \
  --save_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{0}/clip_img_encoder_ViT-B/32" \
  --clf_reps_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{0}/clip_img_encoder_ViT-B/32/{1}_classifier_embeddings.npy" \
  --clip_reps_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{0}/clip_img_encoder_ViT-B/32/{1}_clip_embeddings.npy"

Step4: Retrieving sentences indicative of biases

python ./src/codebase/discover_error_slices.py \
  --seed=0 \
  --topKsent=200 \
  --dataset="Waterbirds" \
  --save_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32" \
  --clf_results_csv="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32/test_additional_info.csv" \
  --clf_image_emb_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32/test_classifier_embeddings.npy" \
  --language_emb_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32/sent_emb_captions_gpt-4o.npy" \
  --sent_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32/sentences_captions_gpt-4o.pkl" \
  --aligner_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32/aligner_30.pth"

Step5: Discovering error slices via LLM-driven hypothesis generation

python ./src/codebase/validate_error_slices_w_LLM.py \
  --seed=0 \
  --LLM="gpt-4o" \
  --dataset="Waterbirds" \
  --class_label="landbirds" \
  --clip_vision_encoder="ViT-B/32" \
  --key="<open-ai key>" \
  --top50-err-text="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32/landbirds_error_top_200_sent_diff_emb.txt" \
  --save_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32" \
  --clf_results_csv="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32/{}_additional_info.csv" \
  --clf_image_emb_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32/{}_classifier_embeddings.npy" \
  --aligner_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32/aligner_30.pth"

Step6: Mitigate multi-bias w/o annotation

python ./src/codebase/mitigate_error_slices.py \
  --seed=0 \
  --epochs=9 \
  --lr=0.001 \
  --weight_decay=0.0001 \
  --n=600 \
  --mode="last_layer_finetune" \
  --dataset="Waterbirds" \
  --classifier="resnet_sup_in1k" \
  --slice_names="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32/{}_prompt_dict.pkl" \
  --classifier_check_pt="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/model.pkl" \
  --save_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32" \
  --clf_results_csv="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32/{}_{}_dataframe_mitigation.csv" \
  --clf_image_emb_path="/restricted/projectnb/batmanlab/shawn24/PhD/Ladder/out/Waterbirds/resnet_sup_in1k_attrNo/Waterbirds_ERM_hparams0_seed{}/clip_img_encoder_ViT-B/32/{}_classifier_embeddings.npy"

➡️ Demo Notebooks With Qualitative Results

Refer to notebook1 and notebook2 for qualitative results of the error slices discovered by LADDER.

📜 Scripts to replicate the experiments of Ladder pipeline

We provide runnable shell scripts to replicate the full LADDER pipeline across all datasets:

📖 Citation

If you find this work useful, please cite our paper:

@inproceedings{ghosh-etal-2025-ladder,
    title = "{LADDER}: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers",
    author = "Ghosh, Shantanu  and
      Syed, Rayan  and
      Wang, Chenyu  and
      Choudhary, Vaibhav  and
      Li, Binxu  and
      Poynton, Clare B  and
      Visweswaran, Shyam  and
      Batmanghelich, Kayhan",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.1177/",
    pages = "22935--22970",
    ISBN = "979-8-89176-256-5",
    abstract = "Slice discovery refers to identifying systematic biases in the mistakes of pre-trained vision models. Current slice discovery methods in computer vision rely on converting input images into sets of attributes and then testing hypotheses about configurations of these pre-computed attributes associated with elevated error patterns. However, such methods face several limitations: 1) they are restricted by the predefined attribute bank; 2) they lack the \textit{common sense} reasoning and domain-specific knowledge often required for specialized fields radiology; 3) at best, they can only identify biases in image attributes while overlooking those introduced during preprocessing or data preparation. We hypothesize that bias-inducing variables leave traces in the form of language (logs), which can be captured as unstructured text. Thus, we introduce ladder, which leverages the reasoning capabilities and latent domain knowledge of Large Language Models (LLMs) to generate hypotheses about these mistakes. Specifically, we project the internal activations of a pre-trained model into text using a retrieval approach and prompt the LLM to propose potential bias hypotheses. To detect biases from preprocessing pipelines, we convert the preprocessing data into text and prompt the LLM. Finally, ladder generates pseudo-labels for each identified bias, thereby mitigating all biases without requiring expensive attribute annotations.Rigorous evaluations on 3 natural and 3 medical imaging datasets, 200+ classifiers, and 4 LLMs with varied architectures and pretraining strategies {--} demonstrate that ladder consistently outperforms current methods. Code is available: \url{https://github.com/batmanlab/Ladder}."
}

License and copyright

Licensed under the Creative Commons Attribution 4.0 International

Contact

For any queries, contact Shantanu Ghosh (email: shawn24@bu.edu)

Contributing

Did you try some other classifier on a new dataset and want to report the results? Feel free to send a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.idea		.idea
data		data
doc		doc
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
classifier_zoo.md		classifier_zoo.md
dataset_zoo.md		dataset_zoo.md
environment.yaml		environment.yaml

License

batmanlab/Ladder

Folders and files

Latest commit

History

Repository files navigation