git clone https://github.com/cvlab-stonybrook/ZoomLDM/
conda activate zoomldm
pip install -r requirements.txt
The model weights are hosted on huggingface. The inference scripts we provide below download the model weights using huggingface hub.

We demonstrate patch-level generation at any scale in sample_patches_brca.ipynb
and sample_patches_naip.ipynb
.


For large image generation, we use the proposed joint multi-scale sampling algorithm.
We provide an implementation of the algorithm in joint_multiscale.ipynb
.
You can find more examples of large images here.


Super-resolution uses the condition inversion algorithm proposed in the paper and with the joint multi-scale sampling to enforce the low-resolution constraint.
We prove an implementation in superres.ipynb
.
To train the model, you need to prepare a multi-scale dataset of {images, conditioning}.
We use the codebase of DS-MIL to extract regions from the WSIs, first at the base 20x magnification. The patches range from 256x256 to 32768x32768 pixels. You might want to use a lower tissue threshold for larger images.
The following command will extract 1024x1024 patches at 20x:
python deepzoom_tiler.py -m 0 -b 20 -s 1024
Refer to this issue for satellite image patch extraction.
We pre-extract UNI embeddings (conditioning) from the full resolution images in a patch-based manner. A 2048x2048 image would result in 64x256x256 patches -> 64x1024 UNI embedding.
We then resize images to 256x256, extract VAE features, and save them together with the UNI embeddings.
For NAIP, we use the pre-trained DINO-v2 ViT-Large (dinov2_vitl14_reg) checkpoint to extract embeddings.
Please take a look at the demo datasets: brca/naip or our dataloader scripts: brca/naip for more details.
Create a config file similar to this, which specifies the dataset, model, and training parameters.
Then, run the training script:
python main.py -t --gpus 0,1,2 --base configs/zoomldm_brca.yaml
@InProceedings{Yellapragada_2025_CVPR,
author = {Yellapragada, Srikar and Graikos, Alexandros and Triaridis, Kostas and Prasanna, Prateek and Gupta, Rajarsi and Saltz, Joel and Samaras, Dimitris},
title = {ZoomLDM: Latent Diffusion Model for Multi-scale Image Generation},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {23453-23463}
}