-
Notifications
You must be signed in to change notification settings - Fork 6.1k
[WIP] Add FlashVideo Text-to-Video Pipeline #10838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks for working on this @dg845! Great to see you active again Happy to review once you think it's ready & help with anything 🤗 |
…probably need to be changed)
…(time_size and noise_timestep embeddings)
…dation and new embeddings for the denoising model)
Hi @a-r-r-o-w, do you think it would be better to have a single consolidated pipeline for both FlashVideo Stage 1 and Stage 2 inference, or to have separate Stage 1 and Stage 2 pipelines? For context, the Stage 1 model generates low-resolution videos, whereas the Stage 2 model takes as input a Stage 1-generated low-res video and generates a high-res enhanced video from it. As far as I know, Stage 1 inference is essentially the same as CogVideoX1.0 inference (e.g. |
Also @a-r-r-o-w, would it be possible to get GPU resources to help with testing the pipeline? My GPU (6GB VRAM) runs out of memory when naively trying to perform inference with the original code and I also get OOM errors when trying to load the larger Stage 1 checkpoint (based on CogVideoX1.0-5B) for e.g. For the memory requirements, I'm mostly thinking about the memory necessary to generate a video using the official code as reference for a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think it would be better to have a single consolidated pipeline for both FlashVideo Stage 1 and Stage 2 inference, or to have separate Stage 1 and Stage 2 pipelines?
Let's do two pipelines given that each stage can be run on its own and Stage 2 does not seem to need an input that comes from Stage 1 (from a quick look, but I may be wrong).
Let's also just create a new FlashVideoPipeline instead of using the CogVideoX one for consistency despite the large amount of replicated code.
would it be possible to get GPU resources to help with testing the pipeline?
I think it should be possible for us to grant a GPU for awesome contributors like you but I don't know the exact details/process. Will cc @apolinario @linoytsaban @yiyixuxu for help. Inference will probably need atleast 40gb with the basic memory optimizations, but during integrations it's easier/faster to test with more memory available.
Really thanks a lot for taking this up! If you'd like me to run any conversion, please LMK and I'll upload the required files.
|
||
|
||
@maybe_allow_in_graph | ||
class CogVideoXBlock(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this block to FlashVideoTransformerBlock
|
||
@property | ||
# Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.attn_processors | ||
def attn_processors(self) -> Dict[str, AttentionProcessor]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the code for attention processors. Going forward, we will be adding support for this differently, and keep the implementations as minimal and only modeling-only as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, should the fuse_qkv_projections
/unfuse_qkv_projections
methods also be removed? And if so, should the corresponding pipeline methods be removed as well?
@dg845 Do you have an estimate of how long you'll need the GPUs for? @apolinario asked if you could create a HF Space. Once you do, an A100 can be allocated for the duration. Let us know if Spaces works for you or if you prefer something else |
Hi @a-r-r-o-w, sorry for the late reply. I've worked a little with HF Spaces before and think it would work for me. IIRC stuff on HF Spaces needs to go through a |
What does this PR do?
This PR adds a pipeline for the FlashVideo (paper, code, weights) text-to-video model. FlashVideo is a two-stage text-to-video model based on CogVideoX that consists of a low-resolution video generation stage followed by a low-resolution to high-resolution upscaling model.
The FlashVideo model was requested in #10767.
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@a-r-r-o-w
@yiyixuxu
@ghunkins