
Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals
Davide Lobba1,2,*, Fulvio Sanguigni2,3,*, Bin Ren1,2, Marcella Cornia3, Rita Cucchiara3, Nicu Sebe1
1University of Trento, 2University of Pisa, 3University of Modena and Reggio Emilia
* Equal contribution
Table of Contents
TEMU-VTOFF is a novel dual-DiT (Diffusion Transformer) architecture designed for the Virtual Try-Off task: generating clean, in-shop images of garments worn by a person. By combining a pretrained feature extractor with a text-enhanced generation module, our method can handle occlusions, multiple garment categories, and ambiguous appearances. It further refines generation fidelity via a feature alignment module based on DINOv2.
Our contribution can be summarized as follows:
- 🎯 Multi-Category Try-Off. We present a unified framework capable of handling multiple garment types (upper-body, lower-body, and full-body clothes) without requiring category-specific pipelines.
- 🔗 Multimodal Hybrid Attention. We introduce a novel attention mechanism that integrates garment textual descriptions into the generative process by linking them with person-specific features. This helps the model synthesize occluded or ambiguous garment regions more accurately.
- ⚡ Garment Aligner Module. We design a lightweight aligner that conditions generation on clean garment images, replacing conventional denoising objectives. This leads to better alignment consistency on the overall dataset and preserves more precise visual retention.
- 📊 Extensive experiments. Experiments on the Dress Code and VITON-HD datasets demonstrate that TEMU-VTOFF outperforms prior methods in both the quality of generated images and alignment with the target garment, highlighting its strong generalization capabilities.
Clone the repository:
git clone https://github.com/davidelobba/TEMU-VTOFF.git
- We recommend installing the required packages using Python's native virtual environment (venv) as follows:
python -m venv venv source venv/bin/activate
- Upgrade pip and install dependencies
pip install --upgrade pip pip install -r requirements.txt
- Create a .env file like the following:
export WANDB_API_KEY="ENTER YOUR WANDB TOKEN" export HF_TOKEN="ENTER YOUR HUGGINGFACE TOKEN" export HF_HOME="PATH WHERE YOU WANT TO SAVE THE HF MODELS"
🧠 Note: Access to Stable Diffusion 3 Medium must be requested via HuggingFace.
Let's generate the in-shop garment image.
source venv/bin/activate
source .env
python inference.py \
--pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \
--pretrained_model_name_or_path_sd3_tryoff "davidelobba/TEMU-VTOFF" \
--seed 42 \
--width "768" \
--height "1024" \
--output_dir "put here the output path" \
--mixed_precision "bf16" \
--example_image "examples/example1.jpg" \
--guidance_scale 2.0 \
--num_inference_steps 28
Generate textual descriptions for each sample using a multimodal VLM (e.g., Qwen2.5-VL).
python precompute_utils/captioning_qwen.py \
--pretrained_model_name_or_path "Qwen/Qwen2.5-VL-7B-Instruct" \
--dataset_name "dresscode" \
--dataset_root "put here your dataset path" \
--filename "qwen_captions_2_5.json"\
--temperatures 0.2
Extract textual features using OpenCLIP, CLIP and T5 text encoders.
phases=("test" "train")
for phase in "${phases[@]}"; do
python precompute_utils/precompute_text_features.py \
--pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \
--dataset_name "dresscode" \
--dataset_root "put here your dataset path" \
--phase $phase \
--order "paired" \
--category "all" \
--output_dir "" \
--seed 42 \
--height 1024 \
--width 768 \
--batch_size 4 \
--mixed_precision "fp16" \
--num_workers 8 \
--text_encoders "T5" "CLIP" \
--captions_type "qwen_text_embeddings"
Extract visual features using OpenCLIP and CLIP vision encoders.
phases=("test" "train")
for phase in "${phases[@]}"; do
python precompute_utils/precompute_image_features.py \
--dataset "dresscode" \
--dataroot "put here your dataset path" \
--phase $phase \
--order "paired" \
--category "all" \
--seed 42 \
--height 1024 \
--width 768 \
--batch_size 4 \
--mixed_precision "fp16" \
--num_workers 8
Let's generate the in-shop garment images of DressCode or VITON-HD using the TEMU-VTOFF model.
source venv/bin/activate
source .env
python inference_dataset.py \
--pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \
--pretrained_model_name_or_path_sd3_tryoff "davidelobba/TEMU-VTOFF" \
--dataset_name "dresscode" \
--dataset_root "put here your dataset path" \
--output_dir "put here the output path" \
--coarse_caption_file "qwen_captions_2_5_0_2.json" \
--phase "test" \
--order "paired" \
--height "1024" \
--width "768" \
--mask_type bounding_box \
--category "all" \
--batch_size 4 \
--mixed_precision "bf16" \
--seed 42 \
--num_workers 8 \
--fine_mask \
--guidance_scale 2.0 \
--num_inference_steps 28
Lead Authors:
- 📧 Davide Lobba: davide.lobba@unitn.it | 🎓 Google Scholar
- 📧 Fulvio Sanguigni: fulvio.sanguigni@unimore.it | 🎓 Google Scholar
For questions about the project, feel free to reach out to any of the lead authors!
Please cite our paper if you find our work helpful:
@article{lobba2025inverse,
title={Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals},
author={Lobba, Davide and Sanguigni, Fulvio and Ren, Bin and Cornia, Marcella and Cucchiara, Rita and Sebe, Nicu},
journal={arXiv preprint arXiv:2505.21062},
year={2025}
}