This is the official implementation of DRA-Ctrl.
by Hengyuan Cao, Yutong Feng, Biao Gong, Yijing Tian, Yunhong Lu, Chuang Liu, and Bin Wang
[2025-07-01] Added a new Gradio app (gradio_app_hf.py) designed similarly to our HuggingFace Space, making it easier to switch tasks, adjust parameters, and directly test examples. The previous Gradio app (gradio_app.py) will remain unchanged.
- release code
- release checkpoints
- use quantized version to save VRAM
- using FramePack as base model
Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects in various status. A natural and valuable research direction is to explore whether a fully trained video generative model in high-dimensional space can effectively support lower-dimensional tasks such as controllable image generation. In this work, we propose a paradigm for video-to-image knowledge compression and task adaptation, termed Dimension-Reduction Attack (
DRA-Ctrl
), which utilizes the strengths of video models, including long-range context modeling and flatten full-attention, to perform various generation tasks. Specially, to address the challenging gap between continuous video frames and discrete image generation, we introduce a mixup-based transition strategy that ensures smooth adaptation. Moreover, we redesign the attention structure with a tailored masking mechanism to better align text prompts with image-level control. Experiments across diverse image generation tasks, such as subject-driven and spatially conditioned generation, show that repurposed video models outperform those trained directly on images. These results highlight the untapped potential of large-scale video generators for broader visual applications.DRA-Ctrl
provides new insights into reusing resource-intensive video models and lays foundation for future unified generative models across visual modalities.
Our method is implemented on Linux with H800 80GB GPU. The peak VRAM consumption stays below 45GB.
conda create --name dra_ctrl python=3.12
pip install -r requirements.txt
We use the community fork for Diffusers-format weights on tencent/HunyuanVideo-I2V as the initialization parameters for the model.
You can download the LoRA weights for various tasks of DRA-Ctrl at this link.
The checkpoint directory is shown below.
DRA-Ctrl/
└── ckpts/
├── HunyuanVideo-I2V/
| ├── image_processor/
| ├── scheduler/
| ...
├── depth-anything-small-hf
| ├── model.safetensors
| ...
├── canny.safetensors
├── coloring.safetensors
├── deblurring.safetensors
├── depth.safetensors
├── depth_pred.safetensors
├── fill.safetensors
├── sr.safetensors
├── subject_driven.safetensors
└── style_transfer.safetensors
python gradio_app.py --config configs/gradio.yaml
For easier switching between tasks, adjusting parameter, and testing examples, please use
python gradio_app_hf.py
For easier testing, in spatially-aligned image generation tasks, when passing the condition image to gradio_app
, there's no need to manually input edge maps, depth maps, or other condition images - only the original image is required. The corresponding condition images will be automatically extracted.
You can use the *_test.jpg
or *_test.png
images from the assets folder as condition images for input to gradio_app
, which will generate the following examples:
Examples:
If you find our work helpful, please cite:
@misc{cao2025dimensionreductionattackvideogenerative,
title={Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis},
author={Hengyuan Cao and Yutong Feng and Biao Gong and Yijing Tian and Yunhong Lu and Chuang Liu and Bin Wang},
year={2025},
eprint={2505.23325},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.23325},
}
This project uses code from the following sources:
- diffusers/models/transformers/transformer_hunyuan_video - Copyright 2024 The HunyuanVideo Team and The HuggingFace Team (Apache 2.0 licensed).
- diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video_image2video - Copyright 2024 The HunyuanVideo Team and The HuggingFace Team (Apache 2.0 licensed).
We would like to thank the contributors to the HunyuanVideo, HunyuanVideo-I2V, diffusers and HuggingFace repositories, for their open research and exploration.