Instella-T2I v0.1 is the first text-to-image model in the AMD Instella model family, trained exclusively using AMD Instinct MI300X GPUs. By representing images in a 1D binary latent space, our tokenizer encodes a 1024x1024 image using just 128 discrete tokens. Compared to the 4096 tokens typically required by standard VQ-VAEs, our tokenizer achieves a 32x token reduction. Instella-T2I v0.1 leverages our Instella-family language model, AMD OLMo-1B, for text encoding. The same architecture also serves as the backbone for both our diffusion and autoregressive models. Thanks to the large VRAM of the AMD Instinct MI300X GPUs and the compact 1D binary latent space adopted in Instella-T2I v0.1, we can fit 4096 images into a single computation node with 8 AMD Instinct MI300X GPUs, achieving a training throughput of over 220 images per second on each GPU. Both the diffusion and auto-regressive text-to-image models can be trained within 200 MI300X GPU days. Training Instella-T2I from scratch on AMD Instinct MI300X GPUs demonstrates the platform’s capability and scalability for a broad range of AI workloads, including computationally intensive text-to-image diffusion models.
First install PyTorch according to the instructions specific to your operating system. For AMD GPUs, you can aslo start with a rocm/pytorch docker.
To install the recommended packages, run:
git clone https://github.com/AMD-AIG-AIMA/Instella-T2I.git
cd Instella-T2I
# install Flash-Attention on MI300X
GPU_ARCH=gfx942 MAX_JOBS=$(nproc) pip install git+https://github.com/Dao-AILab/flash-attention.git -v
# install other dependencies
pip install -r requirements.txt
Using provide test_diff.py
and test_ar.py
to run image generation in interactive mode for the diffusion and AR models.
The inference scripts will automatically download the checkpoints to path specified by --ckpt_path
.
python test_diff.py --ckpt_path DESIRED_PATH_TO_MODELS
python test_ar.py --ckpt_path DESIRED_PATH_TO_MODELS
To specify hyperparameters, run:
python test_diff.py \
--ckpt_path DESIRED_PATH_TO_MODELS \
--cfg_scale 9.0 \
--temp 0.8 \
--num_steps 50 \
The training of the image generation models adopts a two-stage recipe. In stage one, the model is pretrained using the LAION-COCO dataset. In stage two, the data is augmented with synthetic image–text pairs, with a raio of 3:1 between the LAION and the synthetic data. The synthetic data consists of data from Dalle-1M and images generated from public models.
The training also includes a small amout of synthesis data.
The synthesis data are generated using the prompts from DiffusionDB. We use the following open models for generating the synthesis data:
All data are generated using the models' defauls inference settings.
If you find this project helpful for your research, please consider citing us:
@article{instella-t2i,
title={Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation},
author={Wang, Ze and Chen, Hao and Hu, Benran and Liu, Jiang and Sun, Ximeng and Wu, Jialian and Su, Yusheng and Yu, Xiaodong and Barsoum, Emad and Liu, Zicheng},
journal={arXiv preprint arXiv:2506.21022},
year={2025}