1 MMLab CUHK β 2Shanghai AI Lab
*Correspondance β
[project page]β [arXiv]β [Dataset]β [Model]β
Summary: We propose Native-resolution diffusion Transformer (NiT), a model that explicitly learns varing resolutions and aspect ratios within its denoising process. This significantly improves training efficiency and generalization capability. To the best of our knowledge, NiT firstly attains SOTA results on both
2025-6-3
We are delighted to introduce NiT, which is the first work to explicitly model native resolution image synthesis. We have released the code, pretrained models, and processed dataset of NiT.
First, clone the repo:
git clone https://github.com/WZDTHU/NiT.git && cd NiT
conda create -n nit_env python=3.10
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118
pip install flash-attn
pip install -r requirements.txt
pip install -e .
With a single model, NiT-XL can compete on multiple benchmarks and it achieves a dual SOTA on both ImageNet-$256\times256$ and
Model | Model Zoo | Model Size | FID-256x256 | FID-512x512 | FID-768x768 | FID-1024x1024 |
---|---|---|---|---|---|---|
NiT-XL-1000K | π€ HF | 675M | 2.16 | 1.57 | 4.05 | 4.52 |
NiT-XL-1500K | π€ HF | 675M | 2.03 | 1.45 | - | - |
mkdir checkpoints
wget -c "https://huggingface.co/GoodEnough/NiT-XL-Models/resolve/main/model_1000K.safetensors" -O checkpoints/nit_xl_model_1000K.safetensors
wget -c "https://huggingface.co/GoodEnough/NiT-XL-Models/resolve/main/model_1500K.safetensors" -O checkpoints/nit_xl_model_1500K.safetensors
The sampling hyper-parameters for NiT-XL-1000K are summarized as follows:
Resolution | Solver | NFE | CFG - scale | CFG - interval | FID | sFID | IS | Prec. | Rec. |
---|---|---|---|---|---|---|---|---|---|
256 Γ 256 | SDE | 250 | 2.25 | [0.0, 0.7] | 2.16 | 6.34 | 253.44 | 0.79 | 0.62 |
512 Γ 512 | SDE | 250 | 2.05 | [0.0, 0.7] | 1.57 | 4.13 | 260.69 | 0.81 | 0.63 |
768 Γ 768 | ODE | 50 | 3.0 | [0.0, 0.7] | 4.05 | 8.77 | 262.31 | 0.83 | 0.52 |
1024 Γ 1024 | ODE | 50 | 3.0 | [0.0, 0.8] | 4.52 | 7.99 | 286.87 | 0.82 | 0.50 |
1536 Γ 1536 | ODE | 50 | 3.5 | [0.0, 0.9] | 6.51 | 9.97 | 230.10 | 0.83 | 0.42 |
2048 Γ 2048 | ODE | 50 | 4.5 | [0.0, 0.9] | 24.76 | 18.02 | 131.36 | 0.67 | 0.46 |
320 Γ 960 | ODE | 50 | 4.0 | [0.0, 0.9] | 16.85 | 17.79 | 189.18 | 0.71 | 0.38 |
432 Γ 768 | ODE | 50 | 2.75 | [0.0, 0.7] | 4.11 | 10.30 | 254.71 | 0.83 | 0.55 |
480 Γ 640 | ODE | 50 | 2.75 | [0.0, 0.7] | 3.72 | 8.23 | 284.94 | 0.83 | 0.54 |
640 Γ 480 | ODE | 50 | 2.5 | [0.0, 0.7] | 3.41 | 8.07 | 259.06 | 0.83 | 0.56 |
768 Γ 432 | ODE | 50 | 2.85 | [0.0, 0.7] | 5.27 | 9.92 | 218.78 | 0.80 | 0.55 |
960 Γ 320 | ODE | 50 | 4.5 | [0.0, 0.9] | 9.90 | 25.78 | 255.95 | 0.74 | 0.40 |
Sampling with NiT-XL-1000K model for
bash scripts/sample/sample_256x256.sh
Sampling with NiT-XL-1000K model for
bash scripts/sample/sample_512x512.sh
Sampling with NiT-XL-1000K model for
bash scripts/sample/sample_768x768.sh
The sampling generates a folder of samples to compute FID, Inception Score and
other metrics.
Note that we do not pack the generate samples as a .npz
file, this does not affect the calculation of FID and other metrics.
Please follow the ADM's TensorFlow
evaluation suite
to setup the conda-environment and download the reference batch.
wget -c "https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/classify_image_graph_def.pb" -O checkpoints/classify_image_graph_def.pb
Given the directory of the reference batch REFERENCE_DIR
and the directory of the generated images SAMPLING_DIR
, run the following codes:
python projects/evaluate/adm_evaluator.py $REFERENCE_DIR $SAMPLING_DIR
Currently, we provide all the preprocessed dataset for ImageNet1K. Please use the following commands to download the meta files and preprocessed latents.
mkdir datasets
mkdir datasets/imagenet1k
bash tools/download_dataset_256x256.sh
bash tools/download_dataset_512x512.sh
bash tools/download_dataset_native_resolution.sh
You can also preprocess the ImageNet1K dataset on your own.
Take data_dir
as your local ImageNet1K directory in configs/preprocess/imagenet1k_256x256.yaml
.
Then run the preprocess script scripts/preprocess/preorocess_in1k_256x256.sh
.
bash scripts/preprocess/preorocess_in1k_256x256.sh
The proprecessing procedure of
bash scripts/preprocess/preorocess_in1k_512x512.sh
bash scripts/preprocess/preorocess_in1k_native_resolution.sh
As we pack multiple image instances with distinct resolution into one sequence, we need to pre-set the image indices of each pack before the training process.
Down all the data-meta files firstly, which restore the height, width and other information of each image.
bash tools/download_dataset_data_meta.sh
The above command will download four the data-meta files on datasets/imagenet1k/data_meta
directory:
-
dc-ae-f32c32-sana-1.1-diffusers_256x256_meta.jsonl
: data-meta file for$256\times256$ -resolution image data. -
dc-ae-f32c32-sana-1.1-diffusers_512x512_meta.jsonl
, data-meta file for$512\times512$ -resolution image data. -
dc-ae-f32c32-sana-1.1-diffusers_nr_meta.jsonl
, data-meta file for native-resolution image data. -
dc-ae-f32c32-sana-1.1-diffusers_merge_meta.jsonl
, a merged file of the above three files.
The first two items of the native-resolution-image data-meta file (dc-ae-f32c32-sana-1.1-diffusers_nr_meta.jsonl
) are as follows:
{"image_file": "n01601694/n01601694_11629.JPEG", "latent_file": "n01601694/n01601694_11629.safetensors", "ori_w": 580, "ori_h": 403, "latent_h": 12, "latent_w": 18, "image_h": 384, "image_w": 576, "type": "native-resolution"}
{"image_file": "n01601694/n01601694_11799.JPEG", "latent_file": "n01601694/n01601694_11799.safetensors", "ori_w": 500, "ori_h": 350, "latent_h": 10, "latent_w": 15, "image_h": 320, "image_w": 480, "type": "native-resolution"}
Given the maximum length
You can download our preprocessed packed sampler-meta file using the following command.
bash tools/download_dataset_sampler_meta.sh
The above command will download three the data-meta files on datasets/imagenet1k/sampler_meta
directory:
-
dc-ae-f32c32-sana-1.1-diffusers_merge_LPFHP_8192.json
: corresponds to$L=16384$ . -
dc-ae-f32c32-sana-1.1-diffusers_merge_LPFHP_16384.json
: corresponds to$L=16384$ . This is the setting in NiT-XL experiments. -
dc-ae-f32c32-sana-1.1-diffusers_merge_LPFHP_32768.json
, corresponds to$L=32768$ . -
dc-ae-f32c32-sana-1.1-diffusers_merge_LPFHP_65536.json
, corresponds to$L=65536$ .
NiT supports training with images of arbitrary resolutions and aspect ratios, you can also prepare the packing (sampler-meta) according to your own demands.
# generate the default sampler-meta
python tools/pack_dataset.py
# generate the sampelr-meta for fixed 256x256-resolution experiment with the maximum sequence length of 16384
python tools/pack_dataset.py --data-meta datasets/imagenet1k/data_meta/dc-ae-f32c32-sana-1.1-diffusers_256x256_meta.jsonl --max-seq-len 16384
For NiT-S (33M) model, we use RADIO-v2.5-H as image encoder for REPA-loss. For other NiT models, we use RADIO-v2.5-H as our image encoder.
wget -c "https://huggingface.co/nvidia/RADIO/resolve/main/radio_v2.5-h.pth.tar" -O checkpoints/radio_v2.5-h.pth.tar
wget -c "https://huggingface.co/nvidia/RADIO/resolve/main/radio-v2.5-b_half.pth.tar" -O checkpoints/radio-v2.5-b_half.pth.tar
The above steps setup the packed_json
, jsonl_dir
, and latent_dirs
in configs/c2i/nit_xl_pack_merge_radio_16384.yaml
.
Before training, please specify the image_dir
as the directory of ImageNet1K dataset in your own machine.
To train the XL-model (675M):
bash scripts/train/train_xl_model.sh
Specify the image_dir
in configs/c2i/nit_s_pack_merge_radio_65536.yaml
and train the base-model (131M):
bash scripts/train/train_s_model.sh
Specify the image_dir
in configs/c2i/nit_b_pack_merge_radio_65536.yaml
and train the base-model (131M):
bash scripts/train/train_b_model.sh
Specify the image_dir
in configs/c2i/nit_l_pack_merge_radio_16384.yaml
and train the base-model (457M):
bash scripts/train/train_l_model.sh
Specify the image_dir
in configs/c2i/nit_xxl_pack_merge_radio_8192.yaml
and train the xxl-model (1.37B):
bash scripts/train/train_xxl_model.sh
If you find the project useful, please kindly cite:
@article{wang2025native,
title={Native-Resolution Image Synthesis},
author={Wang, Zidong and Bai, Lei and Yue, Xiangyu and Ouyang, Wanli and Zhang, Yiyuan},
year={2025},
eprint={2506.03131},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This project is licensed under the Apache-2.0 license.