Skip to content

Official implementation of the paper "Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals"

License

Notifications You must be signed in to change notification settings

davidelobba/TEMU-VTOFF

Repository files navigation


TEMU-VTOFF

Text-Enhanced MUlti-category Virtual Try-Off

MiniMax

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals
Davide Lobba1,2,*, Fulvio Sanguigni2,3,*, Bin Ren1,2, Marcella Cornia3, Rita Cucchiara3, Nicu Sebe1
1University of Trento, 2University of Pisa, 3University of Modena and Reggio Emilia
* Equal contribution

Table of Contents
  1. About The Project
  2. Key Features
  3. Getting Started
  4. Inference
  5. Dataset Inference
  6. Contact
  7. Citation

💡 About The Project

TEMU-VTOFF is a novel dual-DiT (Diffusion Transformer) architecture designed for the Virtual Try-Off task: generating clean, in-shop images of garments worn by a person. By combining a pretrained feature extractor with a text-enhanced generation module, our method can handle occlusions, multiple garment categories, and ambiguous appearances. It further refines generation fidelity via a feature alignment module based on DINOv2.

✨ Key Features

Our contribution can be summarized as follows:

  • 🎯 Multi-Category Try-Off. We present a unified framework capable of handling multiple garment types (upper-body, lower-body, and full-body clothes) without requiring category-specific pipelines.
  • 🔗 Multimodal Hybrid Attention. We introduce a novel attention mechanism that integrates garment textual descriptions into the generative process by linking them with person-specific features. This helps the model synthesize occluded or ambiguous garment regions more accurately.
  • ⚡ Garment Aligner Module. We design a lightweight aligner that conditions generation on clean garment images, replacing conventional denoising objectives. This leads to better alignment consistency on the overall dataset and preserves more precise visual retention.
  • 📊 Extensive experiments. Experiments on the Dress Code and VITON-HD datasets demonstrate that TEMU-VTOFF outperforms prior methods in both the quality of generated images and alignment with the target garment, highlighting its strong generalization capabilities.

💻 Getting Started

Prerequisites

Clone the repository:

git clone https://github.com/davidelobba/TEMU-VTOFF.git

Installation

  1. We recommend installing the required packages using Python's native virtual environment (venv) as follows:
    python -m venv venv
    source venv/bin/activate
  2. Upgrade pip and install dependencies
    pip install --upgrade pip
    pip install -r requirements.txt
  3. Create a .env file like the following:
    export WANDB_API_KEY="ENTER YOUR WANDB TOKEN"
    export HF_TOKEN="ENTER YOUR HUGGINGFACE TOKEN"
    export HF_HOME="PATH WHERE YOU WANT TO SAVE THE HF MODELS"

🧠 Note: Access to Stable Diffusion 3 Medium must be requested via HuggingFace.

Inference

Let's generate the in-shop garment image.

source venv/bin/activate
source .env

python inference.py \
    --pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \
    --pretrained_model_name_or_path_sd3_tryoff "davidelobba/TEMU-VTOFF" \
    --seed 42 \
    --width "768" \
    --height "1024" \
    --output_dir "put here the output path" \
    --mixed_precision "bf16" \
    --example_image "examples/example1.jpg" \
    --guidance_scale 2.0 \
    --num_inference_steps 28

Dataset Inference

Dataset Captioning

Generate textual descriptions for each sample using a multimodal VLM (e.g., Qwen2.5-VL).

python precompute_utils/captioning_qwen.py \
            --pretrained_model_name_or_path "Qwen/Qwen2.5-VL-7B-Instruct" \
            --dataset_name "dresscode" \
            --dataset_root "put here your dataset path" \
            --filename "qwen_captions_2_5.json"\
            --temperatures 0.2

Feature extraction

Extract textual features using OpenCLIP, CLIP and T5 text encoders.

phases=("test" "train")
for phase in "${phases[@]}"; do
   python precompute_utils/precompute_text_features.py \
               --pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \
               --dataset_name "dresscode" \
               --dataset_root "put here your dataset path" \
               --phase $phase \
               --order "paired" \
               --category "all" \
               --output_dir "" \
               --seed 42 \
               --height 1024 \
               --width 768 \
               --batch_size 4 \
               --mixed_precision "fp16" \
               --num_workers 8 \
               --text_encoders "T5" "CLIP" \
               --captions_type "qwen_text_embeddings"

Extract visual features using OpenCLIP and CLIP vision encoders.

phases=("test" "train")
for phase in "${phases[@]}"; do
   python precompute_utils/precompute_image_features.py \
               --dataset "dresscode" \
               --dataroot "put here your dataset path" \
               --phase $phase \
               --order "paired" \
               --category "all" \
               --seed 42 \
               --height 1024 \
               --width 768 \
               --batch_size 4 \
               --mixed_precision "fp16" \
               --num_workers 8

Generate Images

Let's generate the in-shop garment images of DressCode or VITON-HD using the TEMU-VTOFF model.

source venv/bin/activate
source .env

python inference_dataset.py \
    --pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \
    --pretrained_model_name_or_path_sd3_tryoff "davidelobba/TEMU-VTOFF" \
    --dataset_name "dresscode" \
    --dataset_root "put here your dataset path" \
    --output_dir "put here the output path" \
    --coarse_caption_file "qwen_captions_2_5_0_2.json" \
    --phase "test" \
    --order "paired" \
    --height "1024" \
    --width "768" \
    --mask_type bounding_box \
    --category "all" \
    --batch_size 4 \
    --mixed_precision "bf16" \
    --seed 42 \
    --num_workers 8 \
    --fine_mask \
    --guidance_scale 2.0 \
    --num_inference_steps 28

📬 Contact

Lead Authors:

For questions about the project, feel free to reach out to any of the lead authors!

Citation

Please cite our paper if you find our work helpful:

@article{lobba2025inverse,
  title={Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals},
  author={Lobba, Davide and Sanguigni, Fulvio and Ren, Bin and Cornia, Marcella and Cucchiara, Rita and Sebe, Nicu},
  journal={arXiv preprint arXiv:2505.21062},
  year={2025}
}

About

Official implementation of the paper "Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals"

Topics

Resources

License

Stars

Watchers

Forks

Languages