Authors: Sijie Xu1, Runqi Wang1,2, Wei Zhu1, Dejia Song1, Nemo Chen1, Xu Tang1, Yao Hu1
Affiliations: 1Xiaohongshu, 2ShanghaiTech University
- [2025.02.11] π₯ Source code released
- [2025.02.10] π₯ Pretrained models available on Hugging Face
- [2024.12.31] π₯ Paper published on arXiv
Diffusion-based stylization methods typically denoise from specific partial noise states for image/video tasks. This multi-step process faces computational challenges that hinder real-world applications. While consistency models through trajectory distillation offer acceleration potential, existing approaches only enforce initial-step alignment between student and imperfect teacher models. We propose Single Trajectory Distillation (STD) with three key innovations:
- Single Trajectory Distillation Method starting from a specific partial noise state
- Trajectory Bank for efficient state management
- Asymmetric Adversarial Loss based on DINO-v2 for quality enhancement
Extensive experiments demonstrate our method surpasses existing acceleration models (LCM, TCD, PCM, etc.) in style similarity and aesthetic metrics.
- 8-4 step high-quality stylization
- Unified framework for image & video processin
- Plug-and-play integration with existing SDXL pipelines
Architecture diagram showing (Left) Trajectory Bank management, (Center) Single-trajectory distillation framework, (Right) Asymmetric adversarial loss component
Visual comparison with LCM, TCD, PCM, and other baselines at NFE=8 (CFG=6)
Performance under different CFG values (2-8). Our method (red line) achieves optimal style-content balance.
pip install -r requirements.txt
# !pip install opencv-python
import torch
import diffusers
from diffusers import StableDiffusionXLImg2ImgPipeline
from diffusers.schedulers.scheduling_tcd import TCDScheduler
from PIL import Image
device = "cuda"
std_lora_path = "weights/std/std_sdxl_i2i_eta0.75.safetensors"
pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained("weights/dreamshaper_XL_v21", torch_dtype=torch.float16, variant="fp16").to(device)
# load std lora
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config, timestep_spacing='leading', steps_offset=1)
pipe.load_lora_weights(std_lora_path, adapter_name="std")
pipe.fuse_lora()
# load ipadapter
pipe.load_ip_adapter("ozzygt/sdxl-ip-adapter", "", weight_name="ip-adapter_sdxl_vit-h.safetensors")
pipe.set_ip_adapter_scale(dict(down=0, mid=0, up=dict(block_0=[0, 1, 0], block_1=0))) # only add on 7th block
# inputs
prompt = "Stick figure abstract nostalgic style."
n_prompt = "worst face, NSFW, nudity, nipples, (worst quality, low quality:1.4), blurred, low resolution, pixelated, dull colors, overly simplistic, harsh lighting, lack of detail, poorly composed, dark and gloomy atmosphere, (malformed hands:1.4), (poorly drawn hands:1.4), (mutated fingers:1.4), (extra limbs:1.35), (poorly drawn face:1.4), missing legs, (extra legs:1.4), missing arms, extra arm, ugly, fat, (close shot:1.1), explicit content, sexual content, pornography, adult content, inappropriate, indecent, obscene, vulgar, suggestive, erotic, lewd, provocative, mature content"
src_img = Image.open("doc/imgs/src_img.jpg").resize((960, 1280))
style_img = Image.open("doc/imgs/style_img.png")
image = pipe(
prompt=prompt,
negative_prompt=n_prompt,
num_inference_steps=11, # 8 / 0.75 = 11
guidance_scale=6,
strength=0.75,
image=src_img,
ip_adapter_image=style_img,
).images[0]
image.save("std.png")
We provide pretrained models for both image-to-image and video-to-video tasks with different Ξ· values. All models are hosted on Hugging Face.
Ξ· Value | Model Link |
---|---|
0.65 | std_sdxl_i2i_eta0.65.safetensors |
0.75 | std_sdxl_i2i_eta0.75.safetensors |
0.85 | std_sdxl_i2i_eta0.85.safetensors |
0.95 | std_sdxl_i2i_eta0.95.safetensors |
Ξ· Value | Model Link |
---|---|
0.65 | std_sdxl_v2v_eta0.65.safetensors |
0.75 | std_sdxl_v2v_eta0.75.safetensors |
0.85 | std_sdxl_v2v_eta0.85.safetensors |
0.95 | std_sdxl_v2v_eta0.95.safetensors |
Download the Open-Sora-Plan-v1 dataset from Hugging Face, which is splitted from Panda70M.
weights/
βββ dinov2_vits14_pretrain.pth
βββ ipadapter/
βββ motion_adapter_hsxl/
βββ open_clip_pytorch_model.bin
βββ sdxl_base1.0/
bash scripts/std_sdxl_i2i.sh
@article{xu2024single,
title={Single Trajectory Distillation for Accelerating Image and Video Style Transfer},
author={Xu, Sijie and Wang, Runqi and Zhu, Wei and Song, Dejia and Chen, Nemo and Tang, Xu and Hu, Yao},
journal={arXiv preprint arXiv:2412.18945},
year={2024}
}
This work builds upon MCM. We thank the open-source community for their valuable contributions.