JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on

Aowen Wang¹, Wei Li¹, Hao Luo¹ ², Mengxing Ao¹, Fan Wang¹

¹DAMO Academy, Alibaba Group ²Hupan Lab

Overview

JCo-MVTON introduces a novel framework for mask-free virtual try-on based on MM-DiT that addresses key limitations of existing systems: rigid dependencies on human body masks, limited fine-grained control over garment attributes, and poor generalization to in-the-wild scenarios.

Quick Start

Clone the repository

git clone https://github.com/damo-cv/JCo-MVTON.git
cd JCo-MVTON

Create conda environment

conda create -n jco-mvton python=3.10
conda activate jco-mvton

Install dependencies

pip install -r requirements.txt
git clone https://github.com/huggingface/diffusers.git
cd diffusers
git checkout v0.33.0
cp flux/modeling_utils.py   diffusers/src/diffusers/models
pip install .

Download Pre-trained Models

# Download the upper model checkpoint
wget https://huggingface.co/Damo-vision/JCo-MVTON/resolve/main/try_on_upper.pt

# Download the lower model checkpoint
wget https://huggingface.co/Damo-vision/JCo-MVTON/resolve/main/try_on_lower.pt

# Download the dress model checkpoint
wget https://huggingface.co/Damo-vision/JCo-MVTON/resolve/main/try_on_dress.pt

Basic Usage

# Load transformer
model_id = "black-forest-labs/FLUX.1-dev"
ckpt = 'ckpts/try_on_upper.pt'
transformer = FluxTransformer2DModel.from_pretrained(
model_id,
torch_dtype=torch_dtype,
subfolder="transformer",
extra_branch_num=extra_branch_num,
low_cpu_mem_usage=False,
).to(device)
transformer.load_state_dict(torch.load(ckpt)['module'], strict=False)
pipe = FluxPipeline.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype,
    transformer=transformer,
).to(device)
# Load and preprocess images

person = Image.open('assets/ref.jpg').convert("RGB").resize((width, height))
cloth = Image.open('assets/upper.jpg').convert("RGB").resize((height, height))

person_tensor = transform_person(person)
cloth_tensor = transform_cloth(cloth)

prompt = "A fashion model wearing stylish clothing, high-resolution 8k, detailed textures, realistic lighting, fashion photography style."

# Generate image

with torch.inference_mode():
generated_image = pipe(
generator=torch.Generator(device="cpu").manual_seed(seed),
prompt=prompt,
num_inference_steps=n_steps,
guidance_scale=guidance_scale,
height=height,
width=width,
cloth_img=cloth_tensor,
person_img=person_tensor,
extra_branch_num=extra_branch_num,
mode=mode,
max_sequence_length=77,
).images[0]

# Save result

person_tensor = transform_output(person)
cloth_tensor = transform_output(cloth)
generated_tensor = transform_output(generated_image)

concatenated_tensor = torch.cat((cloth_tensor, person_tensor, generated_tensor), dim=2)
vutils.save_image(concatenated_tensor, 'output.png')

Results

JCo-MVTON achieves state-of-the-art performance across multiple metrics:

Methods	Paired	Paired	Paired	Paired	Unpaired	Unpaired
	SSIM ↑	FID ↓	KID ↓	LPIPS ↓	FID ↓	KID ↓
MV-VTON (Wang et al., 2025b)	0.8083	15.442	7.501	0.1171	17.900	3.861
OOTDiffusion (Xu et al., 2024)	0.8187	9.305	4.086	0.0876	12.408	4.689
JCo-MVTON (Ours)	0.8601	8.103	2.003	0.0891	9.561	2.700

Citation

If you find our work useful, please cite:

@article{wang2024jco,
  title={JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on},
  author={Wang, Aowen and Li, Wei and Luo, Hao and Ao, Mengxing and Wang, Fan},
  journal={arXiv preprint arXiv:xxxx.xxxxx},
  year={2024}
}

License

This project is released under the Apache 2.0 license.

Acknowledgments

We thank the open-source community for their valuable contributions and the reviewers for their constructive feedback. Special thanks to the DAMO Academy and Hupan Lab for supporting this research.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
flux		flux
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on

Overview

Quick Start

Clone the repository

Create conda environment

Install dependencies

Download Pre-trained Models

Basic Usage

Results

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

damo-cv/JCo-MVTON

Folders and files

Latest commit

History

Repository files navigation

JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on

Overview

Quick Start

Clone the repository

Create conda environment

Install dependencies

Download Pre-trained Models

Basic Usage

Results

Citation

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages