Skip to content

VinAIResearch/SwiftTry

Repository files navigation

SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

AAAI 2025

Hung Nguyen    Quang (Qui-Vinh) Nguyen    Khoi Nguyen    Rang Nguyen


arXiv page HuggingFace HuggingFace


Abstract

    Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. Although significant advances have been made in image-based virtual try-on, extending these successes to video often leads to frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequences. To tackle these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we propose ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the TikTokDress dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments demonstrate that our approach outperforms current baselines, particularly in terms of video consistency and inference speed

✨ Citation

Please CITE our paper whenever this repository is used to help produce published results or incorporated into other software:

@inproceedings{nguyen2025swifttry,
  title={SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models},
  author={Nguyen, Hung and Nguyen, Quang Qui-Vinh and Nguyen, Khoi and Nguyen, Rang},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={6},
  pages={6200--6208},
  year={2025}
}

Project structure

  • We provide the following files and folders:
    • tools: utils code such as: preprocess video, extract pose's sequence
    • src: model's source code
    • configs: configs for training/inferencing
  • We also provide the checkpoint at this HuggingFace.

Training

Installation

Environment

  • First create a torch-cuda available environment:
    conda create -n swift_try python=3.10
    conda activate swift_try
  • Install the remaining dependencies:
    pip install -r requirements.txt

Download Pretrained Models

Automatically downloading: You can run the following command to download weights automatically:

python tools/download_weights.py

Weights will be placed under the ./pretrained_sd_models direcotry. The whole downloading process may take a long time.

Manually downloading: You can also download weights manually, which has some steps:

  1. Download our SwiftTry trained weights, which include four parts: denoising_unet.pth, reference_unet.pth, pose_guider.pth and motion_module.pth.

  2. Download pretrained weight of based models and other components:

  3. Download dwpose weights (dw-ll_ucoco_384.onnx, yolox_l.onnx) following this.

Finally, these weights should be orgnized as follows:

./pretrained_sd_models/
|-- DWPose
|   |-- dw-ll_ucoco_384.onnx
|   `-- yolox_l.onnx
|-- image_encoder
|   |-- config.json
|   `-- pytorch_model.bin
|-- sd-vae-ft-mse
|   |-- config.json
|   |-- diffusion_pytorch_model.bin
|   `-- diffusion_pytorch_model.safetensors
|-- swift_try
|   |-- denoising_unet.pth
|   |-- motion_module.pth
|   |-- pose_guider.pth
|   |-- reference_unet.pth
`-- stable-diffusion-v1-5
    |-- feature_extractor
    |   `-- preprocessor_config.json
    |-- model_index.json
    |-- unet
    |   |-- config.json
    |   `-- diffusion_pytorch_model.bin
    `-- v1-inference.yaml

Running

Inference

  1. Infer a normal prompt or a txt file of prompts by using inference.py
    python inference.py --data_dir <DATA_PATH> --test_pairs <TEST_PAIRS_PATH> --save_dir <SAVE_DIR>

The <DATA_DIR> should be in the following structure:

DATA_DIR/
|-- videos
|   |-- 00001.mp4
|   |-- 00002.mp4
|-- videos_masked
|   |-- 00001.mp4
|   |-- 00002.mp4
|-- videos_mask
|   |-- 00001.mp4
|   |-- 00002.mp4
|-- videos_dwpose
|   |-- 00001.mp4
|   |-- 00002.mp4
|-- garments
|   |-- 00001.png
|   |-- 00002.png

the test_pairs should be:

00425.mp4 00425.png
00426.mp4 00426.png
...

Evaluate

Evaluate on TikTokDress dataset

    python evaluate_tiktokdress.py --data_dir <DATA_PATH> --test_pairs <TEST_PAIRS_PATH> --save_dir <SAVE_DIR>

Evaluate on VVT dataset

    python evaluate_vvt.py --data_dir <DATA_PATH> --test_pairs <TEST_PAIRS_PATH> --save_dir <SAVE_DIR>

TikTokDressDataset

To download the TikTokDress Dataset, please follow the instructions in this 🤗 Huggingface Datasets

By downloading these datasets, USER agrees:

  • to use these datasets for research or educational purposes only
  • to not distribute the datasets or part of the datasets in any original or modified form.
  • and to cite our paper whenever these datasets are employed to help produce published results.

Training

Note: package dependencies have been updated, you may upgrade your environment via pip install -r requirements.txt before training.

Data Preparation

Extract DWPose keypoints from raw videos:

python tools/extract_dwpose_from_vid.py --video_root /path/to/your/video_dir

Stage1

Put openpose controlnet weights under ./pretrained_sd_models, which is used to initialize the pose_guider.

Put sd-image-variation under ./pretrained_sd_models, which is used to initialize unet weights.

Run command:

accelerate launch train_tryon_stage_1.py --config configs/train/stage1.yaml

to pretrain model on image virtual try-on dataset (e.g. VITON-HD)

Stage2

Put the pretrained motion module weights mm_sd_v15_v2.ckpt (download link) under ./pretrained_sd_models.

Specify the stage1 training weights in the config file stage2_tiktok_sam2mask.yaml, for example:

stage1_ckpt_dir: './exp_output/stage1_1024x768_ft_upblocks_aug'
stage1_ckpt_step: 30000 

Run command:

accelerate launch train_tryon_stage_2.py --config configs/train/stage2_tiktok_sam2mask.yaml

Acknowledgments

We would like to thank the contributors to the Moore-AnimateAnyone repository, for their open research and exploration.

License

Copyright (c) 2025 VinAI
Licensed under the 3-Clause BSD License.
You may obtain a copy of the License at
    https://opensource.org/license/bsd-3-clause

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages