Skip to content

[WIP] Add FlashVideo Text-to-Video Pipeline #10838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

dg845
Copy link
Contributor

@dg845 dg845 commented Feb 20, 2025

What does this PR do?

This PR adds a pipeline for the FlashVideo (paper, code, weights) text-to-video model. FlashVideo is a two-stage text-to-video model based on CogVideoX that consists of a low-resolution video generation stage followed by a low-resolution to high-resolution upscaling model.

The FlashVideo model was requested in #10767.

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@a-r-r-o-w
@yiyixuxu
@ghunkins

@dg845 dg845 mentioned this pull request Feb 20, 2025
2 tasks
@a-r-r-o-w
Copy link
Member

Thanks for working on this @dg845! Great to see you active again

Happy to review once you think it's ready & help with anything 🤗

@dg845
Copy link
Contributor Author

dg845 commented Apr 10, 2025

Hi @a-r-r-o-w, do you think it would be better to have a single consolidated pipeline for both FlashVideo Stage 1 and Stage 2 inference, or to have separate Stage 1 and Stage 2 pipelines?

For context, the Stage 1 model generates low-resolution videos, whereas the Stage 2 model takes as input a Stage 1-generated low-res video and generates a high-res enhanced video from it. As far as I know, Stage 1 inference is essentially the same as CogVideoX1.0 inference (e.g. CogVideoXPipeline), while Stage 2 inference would be similar to CogVideoXVideoToVideoPipeline (for example, in how the latents are prepared). So for separate pipelines we could have a FlashVideoPipeline for Stage 1 models and a FlashVideoVideoToVideoPipeline for Stage 2 models; in the current code FlashVideoPipeline is trying to handle both Stage 1 and Stage 2 inference. (In the two pipeline case we could also consider whether a separate FlashVideo Stage 1 pipeline is necessary; I think it might end up being exactly the same as CogVideoXPipeline.)

@dg845
Copy link
Contributor Author

dg845 commented Apr 10, 2025

Also @a-r-r-o-w, would it be possible to get GPU resources to help with testing the pipeline? My GPU (6GB VRAM) runs out of memory when naively trying to perform inference with the original code and I also get OOM errors when trying to load the larger Stage 1 checkpoint (based on CogVideoX1.0-5B) for e.g. diffusers checkpoint conversion.

For the memory requirements, I'm mostly thinking about the memory necessary to generate a video using the official code as reference for a slow test to check whether the diffusers implementation and original implementation are equivalent. It seems that at least 64GB of memory is needed currently for Stage 2 inference (see FoundationVision/FlashVideo#15).

Copy link
Member

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it would be better to have a single consolidated pipeline for both FlashVideo Stage 1 and Stage 2 inference, or to have separate Stage 1 and Stage 2 pipelines?

Let's do two pipelines given that each stage can be run on its own and Stage 2 does not seem to need an input that comes from Stage 1 (from a quick look, but I may be wrong).

Let's also just create a new FlashVideoPipeline instead of using the CogVideoX one for consistency despite the large amount of replicated code.

would it be possible to get GPU resources to help with testing the pipeline?

I think it should be possible for us to grant a GPU for awesome contributors like you but I don't know the exact details/process. Will cc @apolinario @linoytsaban @yiyixuxu for help. Inference will probably need atleast 40gb with the basic memory optimizations, but during integrations it's easier/faster to test with more memory available.

Really thanks a lot for taking this up! If you'd like me to run any conversion, please LMK and I'll upload the required files.



@maybe_allow_in_graph
class CogVideoXBlock(nn.Module):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this block to FlashVideoTransformerBlock


@property
# Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.attn_processors
def attn_processors(self) -> Dict[str, AttentionProcessor]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the code for attention processors. Going forward, we will be adding support for this differently, and keep the implementations as minimal and only modeling-only as possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, should the fuse_qkv_projections/unfuse_qkv_projections methods also be removed? And if so, should the corresponding pipeline methods be removed as well?

@a-r-r-o-w
Copy link
Member

@dg845 Do you have an estimate of how long you'll need the GPUs for? @apolinario asked if you could create a HF Space. Once you do, an A100 can be allocated for the duration. Let us know if Spaces works for you or if you prefer something else

@dg845
Copy link
Contributor Author

dg845 commented Jul 21, 2025

Hi @a-r-r-o-w, sorry for the late reply. I've worked a little with HF Spaces before and think it would work for me. IIRC stuff on HF Spaces needs to go through a gradio app so a normal cloud instance would probably be easier to work with, but I don't think this is a big deal. My estimate for how long I'd need the GPU would be a week (but may take longer). I also understand if this project is no longer a priority for resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants