Pytorch Image Tokenizer

A fork of https://github.com/Stability-AI/generative-models, with focus on tokenizers
- Practical size implementation and training code for popular tokenizers, such as VQ, FSQ, LFQ, BSQ
- Both stable diffusion unet and bsq vit backbone support
- With pre-trained model and benchmark on ImageNet 256x256

Usage

Prequisites

dependency in environment.yaml

conda env create --file=environment.yaml
conda activate tokenizer

Installation

from source
```
pip install .
```

Prepare your dataset

It is recommend to list the dataset in advanced using

python scripts/create_dataset_list.py --root $PATH_TO_DATASET_FOLDER --ext $IMAGE_EXTENSION --out $PATH_TO_OUTFILE

It is not mandatory, just speed up training

Training Tokenizers using default config

modify the yaml file according to your system, pay special attention to "trainer-device", "trainer-num_nodes", "data-train-params-root"

Gaussian VAE with stable diffusion UNet

python main.py --config sd3unet_gaussian_kl_0.64.yaml --wandb

FSQ with stable diffusion UNet

python main.py --config sd3unet_fsq_16.yaml --wandb

LFQ with stable diffusion UNet

python main.py --config sd3unet_lfq_16.yaml --wandb

check ./configs/ for more

Evaluating Tokenizers

usage

python -m torch.distributed.launch --standalone --use-env \
--nproc-per-node=8 eval.py \
--bs=32 \
--base=$PATH_TO_YAML_CONFIG \
--ckpt=$PATH_TO_CKPT \
--dataset=$PATH_TO_DATASET_FOLDER

Pre-trained models and benchmark

All models are trained with ImageNet train set, on 8xA100 GPU for around 30 epochs, which takes around 24 hours
All models available in https://huggingface.co/xutongda/pytorch-image-tokenizer-models

spec	config	model	PSNR	SSIM	LPIPS	rFID
LFQ 2^16x1024	sd3unet_lfq_16.yaml	sd3unet_lfq_16.ckpt	22.65	0.635	0.141	3.523
FSQ 2^16x1024	sd3unet_fsq_16.yaml	sd3unet_fsq_16.ckpt	26.87	0.785	0.072	1.161
BSQ 2^16x1024	sd3unet_bsq_16.yaml	sd3unet_bsq_16.ckpt	25.62	0.754	0.086	1.080

Reference

main structure is a fork from: https://github.com/Stability-AI/generative-models
bsq, vit and evaluation from: https://github.com/zhaoyue-zephyrus/bsq-vit
lfq from: https://github.com/TencentARC/SEED-Voken
vq from: https://github.com/ai-forever/MoVQGAN
[VQ NIPS 17] Neural Discrete Representation Learning
[LFQ ICLR 24] Language Model Beats Diffusion: Tokenizer is key to visual generation
[FSQ ICLR 24] Finite Scalar Quantization: VQ-VAE Made Simple
[BSQ ICLR 25] Image and Video Tokenization with Binary Spherical Quantization

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
pit		pit
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
environment.yaml		environment.yaml
eval.py		eval.py
main.py		main.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pytorch Image Tokenizer

Usage

Prequisites

Installation

Prepare your dataset

Training Tokenizers using default config

Evaluating Tokenizers

Pre-trained models and benchmark

Reference

About

Uh oh!

Releases

Packages

Languages

License

tongdaxu/pytorch-image-tokenizer

Folders and files

Latest commit

History

Repository files navigation

Pytorch Image Tokenizer

Usage

Prequisites

Installation

Prepare your dataset

Training Tokenizers using default config

Evaluating Tokenizers

Pre-trained models and benchmark

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages