Hung Nguyen Quang (Qui-Vinh) Nguyen Khoi Nguyen Rang Nguyen
Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. Although significant advances have been made in image-based virtual try-on, extending these successes to video often leads to frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequences. To tackle these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we propose ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the TikTokDress dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments demonstrate that our approach outperforms current baselines, particularly in terms of video consistency and inference speed
Please CITE our paper whenever this repository is used to help produce published results or incorporated into other software:
@inproceedings{nguyen2025swifttry,
title={SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models},
author={Nguyen, Hung and Nguyen, Quang Qui-Vinh and Nguyen, Khoi and Nguyen, Rang},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
number={6},
pages={6200--6208},
year={2025}
}
- We provide the following files and folders:
tools
: utils code such as: preprocess video, extract pose's sequencesrc
: model's source codeconfigs
: configs for training/inferencing
- We also provide the checkpoint at this
.
Training
- First create a torch-cuda available environment:
conda create -n swift_try python=3.10 conda activate swift_try
- Install the remaining dependencies:
pip install -r requirements.txt
Automatically downloading: You can run the following command to download weights automatically:
python tools/download_weights.py
Weights will be placed under the ./pretrained_sd_models
direcotry. The whole downloading process may take a long time.
Manually downloading: You can also download weights manually, which has some steps:
-
Download our SwiftTry trained weights, which include four parts:
denoising_unet.pth
,reference_unet.pth
,pose_guider.pth
andmotion_module.pth
. -
Download pretrained weight of based models and other components:
-
Download dwpose weights (
dw-ll_ucoco_384.onnx
,yolox_l.onnx
) following this.
Finally, these weights should be orgnized as follows:
./pretrained_sd_models/
|-- DWPose
| |-- dw-ll_ucoco_384.onnx
| `-- yolox_l.onnx
|-- image_encoder
| |-- config.json
| `-- pytorch_model.bin
|-- sd-vae-ft-mse
| |-- config.json
| |-- diffusion_pytorch_model.bin
| `-- diffusion_pytorch_model.safetensors
|-- swift_try
| |-- denoising_unet.pth
| |-- motion_module.pth
| |-- pose_guider.pth
| |-- reference_unet.pth
`-- stable-diffusion-v1-5
|-- feature_extractor
| `-- preprocessor_config.json
|-- model_index.json
|-- unet
| |-- config.json
| `-- diffusion_pytorch_model.bin
`-- v1-inference.yaml
- Infer a normal prompt or a txt file of prompts by using
inference.py
python inference.py --data_dir <DATA_PATH> --test_pairs <TEST_PAIRS_PATH> --save_dir <SAVE_DIR>
The <DATA_DIR> should be in the following structure:
DATA_DIR/
|-- videos
| |-- 00001.mp4
| |-- 00002.mp4
|-- videos_masked
| |-- 00001.mp4
| |-- 00002.mp4
|-- videos_mask
| |-- 00001.mp4
| |-- 00002.mp4
|-- videos_dwpose
| |-- 00001.mp4
| |-- 00002.mp4
|-- garments
| |-- 00001.png
| |-- 00002.png
the test_pairs
should be:
00425.mp4 00425.png
00426.mp4 00426.png
...
Evaluate on TikTokDress dataset
python evaluate_tiktokdress.py --data_dir <DATA_PATH> --test_pairs <TEST_PAIRS_PATH> --save_dir <SAVE_DIR>
Evaluate on VVT dataset
python evaluate_vvt.py --data_dir <DATA_PATH> --test_pairs <TEST_PAIRS_PATH> --save_dir <SAVE_DIR>
To download the TikTokDress Dataset, please follow the instructions in this 🤗 Huggingface Datasets
By downloading these datasets, USER agrees:
- to use these datasets for research or educational purposes only
- to not distribute the datasets or part of the datasets in any original or modified form.
- and to cite our paper whenever these datasets are employed to help produce published results.
Note: package dependencies have been updated, you may upgrade your environment via pip install -r requirements.txt
before training.
Extract DWPose keypoints from raw videos:
python tools/extract_dwpose_from_vid.py --video_root /path/to/your/video_dir
Put openpose controlnet weights under ./pretrained_sd_models
, which is used to initialize the pose_guider.
Put sd-image-variation under ./pretrained_sd_models
, which is used to initialize unet weights.
Run command:
accelerate launch train_tryon_stage_1.py --config configs/train/stage1.yaml
to pretrain model on image virtual try-on dataset (e.g. VITON-HD)
Put the pretrained motion module weights mm_sd_v15_v2.ckpt
(download link) under ./pretrained_sd_models
.
Specify the stage1 training weights in the config file stage2_tiktok_sam2mask.yaml
, for example:
stage1_ckpt_dir: './exp_output/stage1_1024x768_ft_upblocks_aug'
stage1_ckpt_step: 30000
Run command:
accelerate launch train_tryon_stage_2.py --config configs/train/stage2_tiktok_sam2mask.yaml
We would like to thank the contributors to the Moore-AnimateAnyone repository, for their open research and exploration.
Copyright (c) 2025 VinAI
Licensed under the 3-Clause BSD License.
You may obtain a copy of the License at
https://opensource.org/license/bsd-3-clause