Skip to content

dynamic-X-LAB/Single-Trajectory-Distillation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Single Trajectory Distillation for Accelerating Image and Video Style Transfer

arXiv Hugging Face Models GitHub Project Page

Authors: Sijie Xu1, Runqi Wang1,2, Wei Zhu1, Dejia Song1, Nemo Chen1, Xu Tang1, Yao Hu1
Affiliations: 1Xiaohongshu, 2ShanghaiTech University

πŸ“’ News

πŸ“– Abstract

Diffusion-based stylization methods typically denoise from specific partial noise states for image/video tasks. This multi-step process faces computational challenges that hinder real-world applications. While consistency models through trajectory distillation offer acceleration potential, existing approaches only enforce initial-step alignment between student and imperfect teacher models. We propose Single Trajectory Distillation (STD) with three key innovations:

  1. Single Trajectory Distillation Method starting from a specific partial noise state
  2. Trajectory Bank for efficient state management
  3. Asymmetric Adversarial Loss based on DINO-v2 for quality enhancement

Extensive experiments demonstrate our method surpasses existing acceleration models (LCM, TCD, PCM, etc.) in style similarity and aesthetic metrics.

🎯 Key Features

  • 8-4 step high-quality stylization
  • Unified framework for image & video processin
  • Plug-and-play integration with existing SDXL pipelines

πŸ–ΌοΈ Visual Results

Method Overview

Pipeline Architecture diagram showing (Left) Trajectory Bank management, (Center) Single-trajectory distillation framework, (Right) Asymmetric adversarial loss component

Qualitative Comparison

Comparison Visual comparison with LCM, TCD, PCM, and other baselines at NFE=8 (CFG=6)

Metric Analysis

Performance under different CFG values (2-8). Our method (red line) achieves optimal style-content balance.

πŸš€ Quick Start

Installation

pip install -r requirements.txt

Inference Demo (Image-to-Image)

# !pip install opencv-python
import torch
import diffusers
from diffusers import StableDiffusionXLImg2ImgPipeline
from diffusers.schedulers.scheduling_tcd import TCDScheduler
from PIL import Image
 
device = "cuda"
std_lora_path = "weights/std/std_sdxl_i2i_eta0.75.safetensors"
 
pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained("weights/dreamshaper_XL_v21", torch_dtype=torch.float16, variant="fp16").to(device)
 
# load std lora
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config, timestep_spacing='leading', steps_offset=1)
pipe.load_lora_weights(std_lora_path, adapter_name="std")
pipe.fuse_lora()
 
# load ipadapter
pipe.load_ip_adapter("ozzygt/sdxl-ip-adapter", "", weight_name="ip-adapter_sdxl_vit-h.safetensors")
pipe.set_ip_adapter_scale(dict(down=0, mid=0, up=dict(block_0=[0, 1, 0], block_1=0))) # only add on 7th block
 
# inputs
prompt = "Stick figure abstract nostalgic style."
n_prompt = "worst face, NSFW, nudity, nipples, (worst quality, low quality:1.4), blurred, low resolution, pixelated, dull colors, overly simplistic, harsh lighting, lack of detail, poorly composed, dark and gloomy atmosphere, (malformed hands:1.4), (poorly drawn hands:1.4), (mutated fingers:1.4), (extra limbs:1.35), (poorly  drawn face:1.4), missing legs, (extra legs:1.4), missing arms, extra arm, ugly, fat, (close shot:1.1), explicit content, sexual content, pornography, adult content, inappropriate, indecent, obscene, vulgar, suggestive, erotic, lewd, provocative, mature content"
src_img = Image.open("doc/imgs/src_img.jpg").resize((960, 1280))
style_img = Image.open("doc/imgs/style_img.png")
 
image = pipe(
    prompt=prompt,
    negative_prompt=n_prompt,
    num_inference_steps=11, # 8 / 0.75 = 11
    guidance_scale=6,
    strength=0.75,
    image=src_img,
    ip_adapter_image=style_img,
).images[0]
 
image.save("std.png")

πŸ“¦ Model Zoo

We provide pretrained models for both image-to-image and video-to-video tasks with different Ξ· values. All models are hosted on Hugging Face.

Image-to-Image Models

Ξ· Value Model Link
0.65 std_sdxl_i2i_eta0.65.safetensors
0.75 std_sdxl_i2i_eta0.75.safetensors
0.85 std_sdxl_i2i_eta0.85.safetensors
0.95 std_sdxl_i2i_eta0.95.safetensors

Video-to-Video Models

Ξ· Value Model Link
0.65 std_sdxl_v2v_eta0.65.safetensors
0.75 std_sdxl_v2v_eta0.75.safetensors
0.85 std_sdxl_v2v_eta0.85.safetensors
0.95 std_sdxl_v2v_eta0.95.safetensors

πŸ› οΈ Training

Data Preparation

Download the Open-Sora-Plan-v1 dataset from Hugging Face, which is splitted from Panda70M.

Model Architecture

weights/
β”œβ”€β”€ dinov2_vits14_pretrain.pth
β”œβ”€β”€ ipadapter/
β”œβ”€β”€ motion_adapter_hsxl/
β”œβ”€β”€ open_clip_pytorch_model.bin
└── sdxl_base1.0/

Training Command

bash scripts/std_sdxl_i2i.sh

πŸ“š Citation

@article{xu2024single,
  title={Single Trajectory Distillation for Accelerating Image and Video Style Transfer},
  author={Xu, Sijie and Wang, Runqi and Zhu, Wei and Song, Dejia and Chen, Nemo and Tang, Xu and Hu, Yao},
  journal={arXiv preprint arXiv:2412.18945},
  year={2024}
}

πŸ™ Acknowledgement

This work builds upon MCM. We thank the open-source community for their valuable contributions.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •