SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

Abstract
✨ Citation
Project structure
Installation
Running
- Inference
- Evaluate
TikTokDress Dataset
Training
License

SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

AAAI 2025

Hung Nguyen Quang (Qui-Vinh) Nguyen Khoi Nguyen Rang Nguyen

Abstract

Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. Although significant advances have been made in image-based virtual try-on, extending these successes to video often leads to frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequences. To tackle these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we propose ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the TikTokDress dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments demonstrate that our approach outperforms current baselines, particularly in terms of video consistency and inference speed

✨ Citation

Please CITE our paper whenever this repository is used to help produce published results or incorporated into other software:

@inproceedings{nguyen2025swifttry,
  title={SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models},
  author={Nguyen, Hung and Nguyen, Quang Qui-Vinh and Nguyen, Khoi and Nguyen, Rang},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={6},
  pages={6200--6208},
  year={2025}
}

Project structure

We provide the following files and folders:
- tools: utils code such as: preprocess video, extract pose's sequence
- src: model's source code
- configs: configs for training/inferencing
We also provide the checkpoint at this .

Training

Installation

Environment

First create a torch-cuda available environment:

conda create -n swift_try python=3.10
conda activate swift_try

Install the remaining dependencies:
```
pip install -r requirements.txt
```

Download Pretrained Models

Automatically downloading: You can run the following command to download weights automatically:

python tools/download_weights.py

Weights will be placed under the ./pretrained_sd_models direcotry. The whole downloading process may take a long time.

Manually downloading: You can also download weights manually, which has some steps:

Download our SwiftTry trained weights, which include four parts: denoising_unet.pth, reference_unet.pth, pose_guider.pth and motion_module.pth.
Download pretrained weight of based models and other components:
Download dwpose weights (dw-ll_ucoco_384.onnx, yolox_l.onnx) following this.

Finally, these weights should be orgnized as follows:

./pretrained_sd_models/
|-- DWPose
|   |-- dw-ll_ucoco_384.onnx
|   `-- yolox_l.onnx
|-- image_encoder
|   |-- config.json
|   `-- pytorch_model.bin
|-- sd-vae-ft-mse
|   |-- config.json
|   |-- diffusion_pytorch_model.bin
|   `-- diffusion_pytorch_model.safetensors
|-- swift_try
|   |-- denoising_unet.pth
|   |-- motion_module.pth
|   |-- pose_guider.pth
|   |-- reference_unet.pth
`-- stable-diffusion-v1-5
    |-- feature_extractor
    |   `-- preprocessor_config.json
    |-- model_index.json
    |-- unet
    |   |-- config.json
    |   `-- diffusion_pytorch_model.bin
    `-- v1-inference.yaml

Running

Inference

Infer a normal prompt or a txt file of prompts by using inference.py

    python inference.py --data_dir <DATA_PATH> --test_pairs <TEST_PAIRS_PATH> --save_dir <SAVE_DIR>

The <DATA_DIR> should be in the following structure:

DATA_DIR/
|-- videos
|   |-- 00001.mp4
|   |-- 00002.mp4
|-- videos_masked
|   |-- 00001.mp4
|   |-- 00002.mp4
|-- videos_mask
|   |-- 00001.mp4
|   |-- 00002.mp4
|-- videos_dwpose
|   |-- 00001.mp4
|   |-- 00002.mp4
|-- garments
|   |-- 00001.png
|   |-- 00002.png

the test_pairs should be:

00425.mp4 00425.png
00426.mp4 00426.png
...

Evaluate

Evaluate on TikTokDress dataset

    python evaluate_tiktokdress.py --data_dir <DATA_PATH> --test_pairs <TEST_PAIRS_PATH> --save_dir <SAVE_DIR>

Evaluate on VVT dataset

    python evaluate_vvt.py --data_dir <DATA_PATH> --test_pairs <TEST_PAIRS_PATH> --save_dir <SAVE_DIR>

TikTokDressDataset

To download the TikTokDress Dataset, please follow the instructions in this 🤗 Huggingface Datasets

By downloading these datasets, USER agrees:

to use these datasets for research or educational purposes only

to not distribute the datasets or part of the datasets in any original or modified form.

and to cite our paper whenever these datasets are employed to help produce published results.

Training

Note: package dependencies have been updated, you may upgrade your environment via pip install -r requirements.txt before training.

Data Preparation

Extract DWPose keypoints from raw videos:

python tools/extract_dwpose_from_vid.py --video_root /path/to/your/video_dir

Stage1

Put openpose controlnet weights under ./pretrained_sd_models, which is used to initialize the pose_guider.

Put sd-image-variation under ./pretrained_sd_models, which is used to initialize unet weights.

Run command:

accelerate launch train_tryon_stage_1.py --config configs/train/stage1.yaml

to pretrain model on image virtual try-on dataset (e.g. VITON-HD)

Stage2

Put the pretrained motion module weights mm_sd_v15_v2.ckpt (download link) under ./pretrained_sd_models.

Specify the stage1 training weights in the config file stage2_tiktok_sam2mask.yaml, for example:

stage1_ckpt_dir: './exp_output/stage1_1024x768_ft_upblocks_aug'
stage1_ckpt_step: 30000

Run command:

accelerate launch train_tryon_stage_2.py --config configs/train/stage2_tiktok_sam2mask.yaml

Acknowledgments

We would like to thank the contributors to the Moore-AnimateAnyone repository, for their open research and exploration.

License

Copyright (c) 2025 VinAI
Licensed under the 3-Clause BSD License.
You may obtain a copy of the License at
    https://opensource.org/license/bsd-3-clause

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

AAAI 2025

Abstract

✨ Citation

Project structure

Installation

Environment

Download Pretrained Models

Running

Inference

Evaluate

TikTokDressDataset

Training

Data Preparation

Stage1

Stage2

Acknowledgments

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
configs		configs
src		src
tools		tools
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.MD		README.MD
evaluate_tiktokdress.py		evaluate_tiktokdress.py
evaluate_vvt.py		evaluate_vvt.py
inference.py		inference.py
inference_image.py		inference_image.py
requirements.txt		requirements.txt
train_tryon_stage_1.py		train_tryon_stage_1.py
train_tryon_stage_2.py		train_tryon_stage_2.py
utils.py		utils.py

License

VinAIResearch/SwiftTry

Folders and files

Latest commit

History

Repository files navigation

SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

AAAI 2025

Abstract

✨ Citation

Project structure

Installation

Environment

Download Pretrained Models

Running

Inference

Evaluate

TikTokDressDataset

Training

Data Preparation

Stage1

Stage2

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages