[WIP] Add FlashVideo Text-to-Video Pipeline #10838

dg845 · 2025-02-20T04:16:43Z

What does this PR do?

This PR adds a pipeline for the FlashVideo (paper, code, weights) text-to-video model. FlashVideo is a two-stage text-to-video model based on CogVideoX that consists of a low-resolution video generation stage followed by a low-resolution to high-resolution upscaling model.

The FlashVideo model was requested in #10767.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@a-r-r-o-w
@yiyixuxu
@ghunkins

…, handle imports

a-r-r-o-w · 2025-02-20T22:09:28Z

Thanks for working on this @dg845! Great to see you active again

Happy to review once you think it's ready & help with anything 🤗

…probably need to be changed)

…(time_size and noise_timestep embeddings)

…dation and new embeddings for the denoising model)

dg845 · 2025-04-10T05:14:25Z

Hi @a-r-r-o-w, do you think it would be better to have a single consolidated pipeline for both FlashVideo Stage 1 and Stage 2 inference, or to have separate Stage 1 and Stage 2 pipelines?

For context, the Stage 1 model generates low-resolution videos, whereas the Stage 2 model takes as input a Stage 1-generated low-res video and generates a high-res enhanced video from it. As far as I know, Stage 1 inference is essentially the same as CogVideoX1.0 inference (e.g. CogVideoXPipeline), while Stage 2 inference would be similar to CogVideoXVideoToVideoPipeline (for example, in how the latents are prepared). So for separate pipelines we could have a FlashVideoPipeline for Stage 1 models and a FlashVideoVideoToVideoPipeline for Stage 2 models; in the current code FlashVideoPipeline is trying to handle both Stage 1 and Stage 2 inference. (In the two pipeline case we could also consider whether a separate FlashVideo Stage 1 pipeline is necessary; I think it might end up being exactly the same as CogVideoXPipeline.)

dg845 · 2025-04-10T05:59:58Z

Also @a-r-r-o-w, would it be possible to get GPU resources to help with testing the pipeline? My GPU (6GB VRAM) runs out of memory when naively trying to perform inference with the original code and I also get OOM errors when trying to load the larger Stage 1 checkpoint (based on CogVideoX1.0-5B) for e.g. diffusers checkpoint conversion.

For the memory requirements, I'm mostly thinking about the memory necessary to generate a video using the official code as reference for a slow test to check whether the diffusers implementation and original implementation are equivalent. It seems that at least 64GB of memory is needed currently for Stage 2 inference (see FoundationVision/FlashVideo#15).

a-r-r-o-w

do you think it would be better to have a single consolidated pipeline for both FlashVideo Stage 1 and Stage 2 inference, or to have separate Stage 1 and Stage 2 pipelines?

Let's do two pipelines given that each stage can be run on its own and Stage 2 does not seem to need an input that comes from Stage 1 (from a quick look, but I may be wrong).

Let's also just create a new FlashVideoPipeline instead of using the CogVideoX one for consistency despite the large amount of replicated code.

would it be possible to get GPU resources to help with testing the pipeline?

I think it should be possible for us to grant a GPU for awesome contributors like you but I don't know the exact details/process. Will cc @apolinario @linoytsaban @yiyixuxu for help. Inference will probably need atleast 40gb with the basic memory optimizations, but during integrations it's easier/faster to test with more memory available.

Really thanks a lot for taking this up! If you'd like me to run any conversion, please LMK and I'll upload the required files.

a-r-r-o-w · 2025-04-10T06:55:00Z

src/diffusers/models/transformers/transformer_flashvideo.py

+
+
+@maybe_allow_in_graph
+class CogVideoXBlock(nn.Module):


Let's rename this block to FlashVideoTransformerBlock

a-r-r-o-w · 2025-04-10T06:56:01Z

src/diffusers/models/transformers/transformer_flashvideo.py

+
+    @property
+    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.attn_processors
+    def attn_processors(self) -> Dict[str, AttentionProcessor]:


Let's remove the code for attention processors. Going forward, we will be adding support for this differently, and keep the implementations as minimal and only modeling-only as possible

To clarify, should the fuse_qkv_projections/unfuse_qkv_projections methods also be removed? And if so, should the corresponding pipeline methods be removed as well?

…deoToVideoPipeline (Stage 2)

a-r-r-o-w · 2025-04-12T06:30:47Z

@dg845 Do you have an estimate of how long you'll need the GPUs for? @apolinario asked if you could create a HF Space. Once you do, an A100 can be allocated for the duration. Let us know if Spaces works for you or if you prefer something else

dg845 · 2025-07-21T23:07:14Z

Hi @a-r-r-o-w, sorry for the late reply. I've worked a little with HF Spaces before and think it would work for me. IIRC stuff on HF Spaces needs to go through a gradio app so a normal cloud instance would probably be easier to work with, but I don't think this is a big deal. My estimate for how long I'd need the GPU would be a week (but may take longer). I also understand if this project is no longer a priority for resources.

initial commit: add flashvideo pipeline dir, copy over CogVideoX code…

e4c0cfe

…, handle imports

dg845 mentioned this pull request Feb 20, 2025

FlashVideo <> Diffusers #10767

Open

2 tasks

dg845 and others added 5 commits February 25, 2025 18:56

Copy over CogVideoX conversion script and tests for FlashVideo (will …

62fd5e4

…probably need to be changed)

Merge branch 'main' into flash-video-pipeline

7bbd21d

Add FlashVideo-specific modules to the FlashVideo DiT implementation …

52953cf

…(time_size and noise_timestep embeddings)

Add FlashVideo-specific inference logic to the pipeline (latent degra…

21e973b

…dation and new embeddings for the denoising model)

Add FlashVideo models and pipeine to diffusers __init__

1dc0051

a-r-r-o-w reviewed Apr 10, 2025

View reviewed changes

dg845 added 2 commits April 10, 2025 15:42

Rename CogVideoXBlock to FlashVideoBlock in transformer_flashvideo

4769350

Refactor pipelines into FlashVideoPipeline (Stage 1) and FlashVideoVi…

7177252

…deoToVideoPipeline (Stage 2)

Merge branch 'main' into flash-video-pipeline

5649374

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add FlashVideo Text-to-Video Pipeline #10838

[WIP] Add FlashVideo Text-to-Video Pipeline #10838

Uh oh!

dg845 commented Feb 20, 2025

Uh oh!

a-r-r-o-w commented Feb 20, 2025

Uh oh!

dg845 commented Apr 10, 2025

Uh oh!

dg845 commented Apr 10, 2025

Uh oh!

a-r-r-o-w left a comment

Uh oh!

a-r-r-o-w Apr 10, 2025

Uh oh!

a-r-r-o-w Apr 10, 2025

Uh oh!

dg845 Apr 11, 2025

Uh oh!

a-r-r-o-w commented Apr 12, 2025

Uh oh!

dg845 commented Jul 21, 2025

Uh oh!

Uh oh!



		@maybe_allow_in_graph
		class CogVideoXBlock(nn.Module):

[WIP] Add FlashVideo Text-to-Video Pipeline #10838

Are you sure you want to change the base?

[WIP] Add FlashVideo Text-to-Video Pipeline #10838

Uh oh!

Conversation

dg845 commented Feb 20, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

a-r-r-o-w commented Feb 20, 2025

Uh oh!

dg845 commented Apr 10, 2025

Uh oh!

dg845 commented Apr 10, 2025

Uh oh!

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

dg845 Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w commented Apr 12, 2025

Uh oh!

dg845 commented Jul 21, 2025

Uh oh!

Uh oh!