TEMU-VTOFF

Text-Enhanced MUlti-category Virtual Try-Off

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals
Davide Lobba^1,2,*, Fulvio Sanguigni^2,3,*, Bin Ren^1,2, Marcella Cornia³, Rita Cucchiara³, Nicu Sebe¹
¹University of Trento, ²University of Pisa, ³University of Modena and Reggio Emilia
^* Equal contribution

Table of Contents

About The Project
Key Features
Getting Started
- Prerequisites
- Installation
Inference
Dataset Inference

Dataset Captioning
Feature Extraction
Generate Images

Contact
Citation

💡 About The Project

TEMU-VTOFF is a novel dual-DiT (Diffusion Transformer) architecture designed for the Virtual Try-Off task: generating clean, in-shop images of garments worn by a person. By combining a pretrained feature extractor with a text-enhanced generation module, our method can handle occlusions, multiple garment categories, and ambiguous appearances. It further refines generation fidelity via a feature alignment module based on DINOv2.

✨ Key Features

Our contribution can be summarized as follows:

🎯 Multi-Category Try-Off. We present a unified framework capable of handling multiple garment types (upper-body, lower-body, and full-body clothes) without requiring category-specific pipelines.
🔗 Multimodal Hybrid Attention. We introduce a novel attention mechanism that integrates garment textual descriptions into the generative process by linking them with person-specific features. This helps the model synthesize occluded or ambiguous garment regions more accurately.
⚡ Garment Aligner Module. We design a lightweight aligner that conditions generation on clean garment images, replacing conventional denoising objectives. This leads to better alignment consistency on the overall dataset and preserves more precise visual retention.
📊 Extensive experiments. Experiments on the Dress Code and VITON-HD datasets demonstrate that TEMU-VTOFF outperforms prior methods in both the quality of generated images and alignment with the target garment, highlighting its strong generalization capabilities.

💻 Getting Started

Prerequisites

Clone the repository:

git clone https://github.com/davidelobba/TEMU-VTOFF.git

Installation

We recommend installing the required packages using Python's native virtual environment (venv) as follows:
```
python -m venv venv
source venv/bin/activate
```

Upgrade pip and install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Create a .env file like the following:

export WANDB_API_KEY="ENTER YOUR WANDB TOKEN"
export HF_TOKEN="ENTER YOUR HUGGINGFACE TOKEN"
export HF_HOME="PATH WHERE YOU WANT TO SAVE THE HF MODELS"

🧠 Note: Access to Stable Diffusion 3 Medium must be requested via HuggingFace.

Inference

Let's generate the in-shop garment image.

source venv/bin/activate
source .env

python inference.py \
    --pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \
    --pretrained_model_name_or_path_sd3_tryoff "davidelobba/TEMU-VTOFF" \
    --seed 42 \
    --width "768" \
    --height "1024" \
    --output_dir "put here the output path" \
    --mixed_precision "bf16" \
    --example_image "examples/example1.jpg" \
    --guidance_scale 2.0 \
    --num_inference_steps 28

Dataset Inference

Dataset Captioning

Generate textual descriptions for each sample using a multimodal VLM (e.g., Qwen2.5-VL).

python precompute_utils/captioning_qwen.py \
            --pretrained_model_name_or_path "Qwen/Qwen2.5-VL-7B-Instruct" \
            --dataset_name "dresscode" \
            --dataset_root "put here your dataset path" \
            --filename "qwen_captions_2_5.json"\
            --temperatures 0.2

Feature extraction

Extract textual features using OpenCLIP, CLIP and T5 text encoders.

phases=("test" "train")
for phase in "${phases[@]}"; do
   python precompute_utils/precompute_text_features.py \
               --pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \
               --dataset_name "dresscode" \
               --dataset_root "put here your dataset path" \
               --phase $phase \
               --order "paired" \
               --category "all" \
               --output_dir "" \
               --seed 42 \
               --height 1024 \
               --width 768 \
               --batch_size 4 \
               --mixed_precision "fp16" \
               --num_workers 8 \
               --text_encoders "T5" "CLIP" \
               --captions_type "qwen_text_embeddings"

Extract visual features using OpenCLIP and CLIP vision encoders.

phases=("test" "train")
for phase in "${phases[@]}"; do
   python precompute_utils/precompute_image_features.py \
               --dataset "dresscode" \
               --dataroot "put here your dataset path" \
               --phase $phase \
               --order "paired" \
               --category "all" \
               --seed 42 \
               --height 1024 \
               --width 768 \
               --batch_size 4 \
               --mixed_precision "fp16" \
               --num_workers 8

Generate Images

Let's generate the in-shop garment images of DressCode or VITON-HD using the TEMU-VTOFF model.

source venv/bin/activate
source .env

python inference_dataset.py \
    --pretrained_model_name_or_path "stabilityai/stable-diffusion-3-medium-diffusers" \
    --pretrained_model_name_or_path_sd3_tryoff "davidelobba/TEMU-VTOFF" \
    --dataset_name "dresscode" \
    --dataset_root "put here your dataset path" \
    --output_dir "put here the output path" \
    --coarse_caption_file "qwen_captions_2_5_0_2.json" \
    --phase "test" \
    --order "paired" \
    --height "1024" \
    --width "768" \
    --mask_type bounding_box \
    --category "all" \
    --batch_size 4 \
    --mixed_precision "bf16" \
    --seed 42 \
    --num_workers 8 \
    --fine_mask \
    --guidance_scale 2.0 \
    --num_inference_steps 28

📬 Contact

Lead Authors:

📧 Davide Lobba: davide.lobba@unitn.it | 🎓 Google Scholar
📧 Fulvio Sanguigni: fulvio.sanguigni@unimore.it | 🎓 Google Scholar

For questions about the project, feel free to reach out to any of the lead authors!

Citation

Please cite our paper if you find our work helpful:

@article{lobba2025inverse,
  title={Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals},
  author={Lobba, Davide and Sanguigni, Fulvio and Ren, Bin and Cornia, Marcella and Cucchiara, Rita and Sebe, Nicu},
  journal={arXiv preprint arXiv:2505.21062},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
dataset		dataset
examples		examples
precompute_utils		precompute_utils
src		src
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
inference_dataset.py		inference_dataset.py
metrics.py		metrics.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TEMU-VTOFF

Text-Enhanced MUlti-category Virtual Try-Off

💡 About The Project

✨ Key Features

💻 Getting Started

Prerequisites

Installation

Inference

Dataset Inference

Dataset Captioning

Feature extraction

Generate Images

📬 Contact

Citation

About

Uh oh!

Languages

License

davidelobba/TEMU-VTOFF

Folders and files

Latest commit

History

Repository files navigation

TEMU-VTOFF

Text-Enhanced MUlti-category Virtual Try-Off

💡 About The Project

✨ Key Features

💻 Getting Started

Prerequisites

Installation

Inference

Dataset Inference

Dataset Captioning

Feature extraction

Generate Images

📬 Contact

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages