Official Implementation of ["Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models"]
This repository contains the official implementation of our paper: Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models.
Please follow the official link for setting up the environment.
- 🔥 Latest News
- 📀 Installation
- 🚀 Running the Code
- 📊 Quantitative Comparison
- ⚡ Scale to Multi-GPU
- 📝 To-Do List
🔔 Latest News
• If you like our project, please give us a star ⭐ on GitHub for the latest update.
• [2025/04/04] 🎉 Submitted to arXiv for review.
• [2025/04/04] 🔥 Released open-source code for the latest model.
Follow the official HunyuanVideo and WAN 2.1 environment setup guide.
pip install -r requirements.txt
cd HunyuanVideo
python3 sample_video.py \
--video-size 360 720 \
--video-length 129 \
--infer-steps 50 \
--prompt "cat walk on grass" \
--flow-reverse \
--use-cpu-offload \
--save-path ./results \
--seed 42 \
--model-base "ckpts" \
--dit-weight "ckpts/hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt" \
--delta_cache
cd Wan2.1
python generate.py \
--task t2v-14B \
--size 832*480 \
--frame_num 81 \
--prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
--delta_cache
Method | VBench ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | FID ↓ | Latency (ms) ↓ | Speedup ↑ |
---|---|---|---|---|---|---|---|
HunyuanVideo (720P, 129 frames) | 0.7703 | -- | -- | -- | -- | 1745 | -- |
TeaCache (slow) Tea | 0.7700 | 0.1720 | 21.91 | 0.7456 | 77.67 | 1052 | 1.66× |
TeaCache (fast) Tea | 0.7677 | 0.1830 | 21.60 | 0.7323 | 83.85 | 753 | 2.31× |
Ours (HunyuanVideo) | 0.7642 | 0.1203 | 26.44 | 0.8445 | 41.10 | 932 | 1.87× |
Method | VBench ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | FID ↓ | Latency (ms) ↓ | Speedup ↑ |
---|---|---|---|---|---|---|---|
Wan2.1 (480P, 81 frames) | 0.7582 | -- | -- | -- | -- | 497 | -- |
TeaCache (0.2thres) Tea | 0.7604 | 0.2913 | 16.17 | 0.5685 | 117.61 | 249 | 2.00× |
Ours (Wan2.1) | 0.7615 | 0.1256 | 22.02 | 0.7899 | 62.56 | 247 | 2.01× |
Tables: Quantitative comparison with prior methods under HunyuanVideo and Wan2.1 baselines.
🔺 Higher is better for VBench, PSNR, SSIM, and Speedup.
🔻 Lower is better for LPIPS, FID, and Latency.
Our method efficiently scales across multiple GPUs to accelerate inference and training.
By leveraging model parallelism, NCCL communication, and optimized memory management, we achieve significant speedup without compromising quality.
- Increased Throughput 🚀: Distributes computation across multiple GPUs to process more frames in parallel.
- Optimized Memory Usage 🔧: Dynamically allocates memory to prevent bottlenecks.
- Flexible Deployment 💡: Works seamlessly on both single-node and distributed setups.
- NCCL Optimization 🔄: Uses efficient GPU-GPU communication to minimize overhead.
For detailed setup and configurations, please refer to our Multi-GPU Guide. 🚀
- OpenSora2 🏗️ (Upcoming Support)
- Optimize Caching for CogVideoX ⚙️
@misc{ma2025modelrevealscacheprofilingbased,
title={Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models},
author={Xuran Ma and Yexin Liu and Yaofu Liu and Xianfeng Wu and Mingzhe Zheng and Zihao Wang and Ser-Nam Lim and Harry Yang},
year={2025},
eprint={2504.03140},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.03140},
}
This project is licensed under the Apache 2.0 License.