Luozhou Wang*, Yijun Li**, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Ying-Cong Chen†
HKUST(GZ), HKUST, Adobe Research.
* Internship Project
** Project Lead
† Corresponding Author
Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes.
We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixeler preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data.
Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
-
[2025.04.28] We have introduced a new development branch
wan
that integrates the Wan2.1 video generation model to support joint generation tasks. This branch includes training code tailored for generating both RGB and associated modalities (e.g., segmentation maps, alpha masks) from a shared text prompt. -
[2025.02.26] TransPixeler is accepted by CVPR 2025! See you in Nashville!
-
[2025.01.19] We've renamed our project from TransPixar to TransPixeler!!
-
[2025.01.17] We’ve created a Discord group and a WeChat group! Everyone is welcome to join for discussions and collaborations.
-
[2025.01.14] Added new tasks to the repository's roadmap, including support for Hunyuan and LTX video models, and ComfyUI integration.
-
[2025.01.07] Released project page, arXiv paper, inference code, and Hugging Face demo.
We have introduced a new development branch wan
that integrates the Wan2.1 video generation model to support joint generation tasks.
In the wan
branch, we have developed and released training code tailored for joint generation scenarios, enabling the simultaneous generation of RGB videos and associated modalities (e.g., segmentation maps, alpha masks) from a shared text prompt.
Key features of the wan
branch:
- Integration of Wan2.1: Leverages the capabilities of the Wan2.1 video generation model for enhanced performance.
- Joint Generation Support: Facilitates the concurrent generation of RGB and paired modality videos.
- Dataset Structure: Expects each sample to include:
- A primary video file (
001.mp4
) representing the RGB content. - A paired secondary video file (
001_seg.mp4
) with a fixed_seg
suffix, representing the associated modality. - A caption text file (
001.txt
) with the same base name as the primary video.
- A primary video file (
- Periodic Evaluation: Supports periodic video sampling during training by setting
eval_every_step
oreval_every_epoch
in the configuration. - Customized Pipelines: Offers tailored training and inference pipelines designed specifically for joint generation tasks.
👉 To utilize the joint generation features, please checkout the wan
branch.
# For the main branch
conda create -n TransPixeler python=3.10
conda activate TransPixeler
pip install -r requirements.txt
Note:
If you want to use the Wan2.1 model, please first checkout the wan
branch:
git checkout wan
Our pipeline is designed to support various video tasks, including Text-to-RGBA Video, Image-to-RGBA Video.
We provide the following pre-trained LoRA weights:
Task | Base Model | Frames | LoRA weights | Inference VRAM |
---|---|---|---|---|
T2V + RGBA | THUDM/CogVideoX-5B | 49 | link | ~24GB |
We have open-sourced the training code for Mochi on RGBA joint generation. Please refer to the Mochi README for details.
In addition to the Hugging Face online demo, users can also launch a local inference demo based on CogVideoX-5B by running the following command:
python app.py
To generate RGBA videos, navigate to the corresponding directory for the video model and execute the following command:
python cli.py \
--lora_path /path/to/lora \
--prompt "..."
- finetrainers: We followed their implementation of Mochi training and inference.
- CogVideoX: We followed their implementation of CogVideoX training and inference.
We are grateful for their exceptional work and generous contribution to the open-source community.
@misc{wang2025transpixeler,
title={TransPixeler: Advancing Text-to-Video Generation with Transparency},
author={Luozhou Wang and Yijun Li and Zhifei Chen and Jui-Hsien Wang and Zhifei Zhang and He Zhang and Zhe Lin and Ying-Cong Chen},
year={2025},
eprint={2501.03006},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.03006},
}