Skip to content

Add SkyReels V2: Infinite-Length Film Generative Model #11518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 349 commits into
base: main
Choose a base branch
from

Conversation

tolgacangoz
Copy link
Contributor

@tolgacangoz tolgacangoz commented May 7, 2025

Thanks for the opportunity to fix #11374!

Original Work

Original repo: https://github.com/SkyworkAI/SkyReels-V2
Paper: https://huggingface.co/papers/2504.13074

SkyReels V2's main contributions are summarized as follow:
• Comprehensive video captioner that understand the shot language while capturing the general description of the video, which dramatically improve the prompt adherence.
• Motion-specific preference optimization enhances motion dynamics with a semi-automatic data collection pipeline.
• Effective Diffusion-forcing adaptation enables the generation of ultra-long videos and story generation capabilities, providing a robust framework for extending temporal coherence and narrative depth.
• SkyCaptioner-V1 and SkyReels-V2 series models including diffusion-forcing, text2video, image2video, camera director and elements2video models with various sizes (1.3B, 5B, 14B) are open-sourced.

main_pipeline

TODOs:
FlowMatchUniPCMultistepScheduler: just copy-pasted from the original repo
SkyReelsV2Transformer3DModel: 90% WanTransformer3DModel
SkyReelsV2DiffusionForcingPipeline
SkyReelsV2DiffusionForcingImageToVideoPipeline: Includes FLF2V.
SkyReelsV2DiffusionForcingVideoToVideoPipeline: Extends a given video.
SkyReelsV2Pipeline
SkyReelsV2ImageToVideoPipeline: Includes FLF2V.
scripts/convert_skyreelsv2_to_diffusers.py

⏳ Did you make sure to update the documentation with your changes? Did you write any new necessary tests?: We will construct these during review.

T2V with Diffusion Forcing (OLD)

Skywork/SkyReels-V2-DF-1.3B-540P
seed 0 and num_frames 97
Original repo diffusers integration
original_0_short.mp4
diffusers_0_short.mp4
seed 37 and num_frames 97
Original repo diffusers integration
original_37_short.mp4
diffusers_37_short.mp4
seed 0 and num_frames 257
Original repo diffusers integration
original_0_long.mp4
diffusers_0_long.mp4
seed 37 and num_frames 257
Original repo diffusers integration
original_37_long.mp4
diffusers_37_long.mp4
!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingPipeline
from diffusers.utils import export_to_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."

output = pipe(
    prompt=prompt,
    num_inference_steps=30,
    height=544,
    width=960,
    num_frames=97,
    ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "T2V.mp4", fps=24, quality=8)

"""
You can set `ar_step=5` to enable asynchronous inference. When asynchronous inference,
`causal_block_size=5` is recommended while it is not supposed to be set for
synchronous generation. Asynchronous inference will take more steps to diffuse the
whole sequence which means it will be SLOWER than synchronous mode. In our
experiments, asynchronous inference may improve the instruction following and visual consistent performance.
"""

I2V with Diffusion Forcing (OLD)

prompt="A penguin dances." diffusers integration
i2v-short.mp4
#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

image = load_image("Penguin from https://huggingface.co/tasks/image-to-video")
prompt = "A penguin dances."

output = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=97,
    #ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "I2V.mp4", fps=24, quality=8)

"""
When I set `ar_step=5` and `causal_block_size=5`, then the results seem really bad.
"""

FLF2V with Diffusion Forcing (OLD)

Now, Houston, we have a problem.
I have been unable to produce good results with this task. I tried many hyperparameter combinations with the original code.
The first frame's latent (torch.Size([1, 16, 1, 68, 120])) is overwritten onto the first of 25 frame latents of latents (torch.Size([1, 16, 25, 68, 120])). Then, the last frame's latent is concatenated, thus latents is torch.Size([1, 16, 26, 68, 120]). After the denoising process, the length of the last frame latent is discarded at the end and then decoded by the VAE. I tried not concatenating the last frame but overwriting onto the latest frame of latents and not discarding the latest frame latent at the end, but still got bad results. Here are some results:

First Frame Last Frame
0.mp4
1.mp4
2.mp4
3.mp4
4.mp4
5.mp4
6.mp4
7.mp4
#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

output = pipe(
    image=first_frame,
    last_image=last_frame,
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=97,
    #ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "FLF2V.mp4", fps=24, quality=8)

V2V with Diffusion Forcing (OLD)

This pipeline extends a given video.

Input Video diffusers integration
video1.mp4
v2v.mp4
#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingVideoToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "CG animation style, a small blue bird flaps its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its continuing flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
video = load_video("Input video.mp4")

output = pipe(
    video=video,
    prompt=prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=120,
    base_num_frames=97,
    ar_step=0,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=17,  # Number of frames to overlap for smooth transitions in long videos
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "V2V.mp4", fps=24, quality=8)

Firstly, I want to congratulate you on this great work, and thanks for open-sourcing it, SkyReels Team! This PR proposes an integration of your model.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@yiyixuxu @a-r-r-o-w @linoytsaban @yjp999 @Howe2018 @RoseRollZhu @pftq @Langdx @guibinchen @qiudi0127 @nitinmukesh @tin2tin @ukaprch @okaris

@ukaprch
Copy link

ukaprch commented May 8, 2025

It's about time. Thanks.

@tolgacangoz
Copy link
Contributor Author

tolgacangoz commented May 14, 2025

Hi @yiyixuxu @a-r-r-o-w

Mid-PR questions:

  1. The issue was labelled as "contributions-welcome" but not as "community-examples". Also, the number of stars in this model surpassed that of SkyReels-V1. Thus, I located these pipelines in src/diffusers/pipelines/skyreels_v2/. Should I move these pipelines to examples/? Should I also split this PR for each pipeline (group)?

  2. Just like SkyReels-V1 was based on HunyuanVideo, SkyReels-V2 is based on Wan, but some differences exist. I thought of moving the differences to the parent abstraction, i.e., pipeline code, so that we can use WanTransformer3DModel for both, but it didn't seem appropriate enough to me at first. But then, if we introduce Diffusion Forcing and AutoRegressive properties into WanTransformer3DModel (as native as possible, not with the exact diff below), it seems possible to me. You can examine the current diff between transformer_wan.py and transformer_skyreels_v2.py: https://www.diffchecker.com/U72HJ6ox/ WDYT?

  1. Since SkyReels-V2 is not a completely new architecture, should I move its pipelines into src/diffusers/pipelines/wan/ similar to HunyuanSkyreelsImageToVideoPipeline, if SkyReels-V2 is seen as an official pipeline?

  2. I am removing TeaCache-related code because it is planned for a modular extension, right? If this PR is required to move to examples/, then no need to remove, I think.

  3. I came across this:

    / (theta ** (torch.arange(0, dim, 2, dtype=freqs_dtype, device=pos.device)[: (dim // 2)] / dim))

    At first, [: (dim // 2)] confused me :S Isn't it redundant? dim was already confirmed even with assert dim % 2 == 0. Can I remove [: (dim // 2)] in a separate PR?

@a-r-r-o-w
Copy link
Member

@tolgacangoz Thanks for working on this, really cool work so far!

  1. I think we should add SkyReels models in core diffusers, so src/ is fine.

2 and 3. I think in this case, we should have separate implementation of SkyReelsV2 and Wan due to the autoregressive nature of the former. Adding any extra code in Wan might complicate it for readers. Will let @yiyixuxu comment on this though

  1. Yeah let's remove the cache code. We'll try to write a model agnostic implementation in future once more of the cache related code is stablized after adding some of the easier methods that are not too model intrusive (such as first block cache).

  2. You're right, it's redundant. Let's remove in a separate PR

@tolgacangoz tolgacangoz changed the title Add SkyReels-V2 pipelines Add SkyReels V2: Infinite-Length Film Generative Model May 16, 2025
@ukaprch
Copy link

ukaprch commented May 16, 2025

FWIW, I have been successful in using the same T5 encoder for WAN 2.1 for this model just by fiddling with their pipeline:

        print('Quantize text_encoder qint8')
        class  QuantizedT5EncoderModelForCausalLM (QuantizedTransformersModel):
            auto_class = UMT5EncoderModel
            auto_class.from_config = auto_class._from_config
        text_encoder = QuantizedT5EncoderModelForCausalLM.from_pretrained(
            "./wan quantro T2V-14B-720P Diffusers/basemodel/t5encodermodel_qint8"
        ).to(dtype=dtype)
        
        pipe = Text2VideoPipeline(
            model_path=model_path,
            transformer=transformer,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            weight_dtype=dtype)

Then this: I incorporate my bitsandbytes nf4 transformer, their tokenizer and the WAN based T5 encoder:

def __init__(
    self, model_path, transformer, text_encoder, tokenizer, device: str = "cuda", weight_dtype=torch.bfloat16, use_usp=False, offload=False,
):
    self.transformer = transformer          #get_transformer(model_path, 'cpu', weight_dtype)
    vae_model_path = os.path.join(model_path, "Wan2.1_VAE.pth")
    self.vae = get_vae(vae_model_path, 'cpu', weight_dtype=torch.float32)
    if text_encoder is not None:
        self.text_encoder = text_encoder        #get_text_encoder(model_path, 'cuda', weight_dtype)
    if tokenizer is not None: 
        self.tokenizer = tokenizer

I need to add this function to the pipeline for the T5 encoder to work:

def encode(self, texts):
    ids, mask = self.tokenizer(texts, return_mask=True, add_special_tokens=True)
    ids = ids.to(self.device)
    mask = mask.to(self.device)
    context = self.text_encoder(ids, mask)
    #seq_lens = mask.gt(0).sum(dim=1).long()
    context = context.last_hidden_state * mask.unsqueeze(-1)
    return context

@tolgacangoz
Copy link
Contributor Author

tolgacangoz commented May 19, 2025

It seems appropriate to me. Only Diffusion Forcing pipelines are different for large models. How are the results with your setting?

…correct dimensions for latent model input and noise application.
@tolgacangoz
Copy link
Contributor Author

tolgacangoz commented May 23, 2025

Hi @yiyixuxu @a-r-r-o-w and SkyReels Team @yjp999 @pftq @Langdx @guibinchen ...

This PR will be ready for review for SkyReelsV2Transformer3DModel and SkyReelsV2DiffusionForcingPipeline soon. Other pipelines will follow quickly after initial feedback...

…reflect the actual length of the step matrix, ensuring accurate progress tracking during inference.
…ForcingPipeline` to improve robustness during inference. Added try-except blocks for better error reporting and streamlined tensor operations for noise application and latent updates.
…ionForcingPipeline` to enhance clarity and efficiency. Updated progress bar total to match the number of inference steps, ensuring accurate tracking. Streamlined the handling of latent model inputs and noise predictions for improved performance during inference.
…orcingPipeline` to improve clarity and maintainability. Updated prefix video latent length variables for consistency and corrected tensor slicing to ensure proper dimensions during processing.
…ading sharded model files and updating model configuration. Refactor model loading logic to accommodate new model types and ensure proper initialization of components such as the tokenizer and scheduler.
… support new model type `SkyReelsV2-DF-14B-540P`. Adjusted parameters including `in_channels`, `added_kv_proj_dim`, and `inject_sample_info`. Refactored sharded model loading logic to accommodate varying shard counts based on model type.
…to use `register_to_config` for setting configuration parameters. This change improves clarity and maintains consistency in model configuration handling.
…mask` based on configuration flag. This change enhances flexibility in model behavior during training and inference.
…ensure consistency and correct functionality.
Copy link
Member

@a-r-r-o-w a-r-r-o-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the iterations @tolgacangoz! PR looks very close and I just have minor comments on my end. Off to @yiyixuxu

> [!TIP]
> Click on the SkyReels-V2 models in the right sidebar for more examples of video generation.

### A _Visual_ Demonstration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is great! Thank you! cc @stevhliu

super().__init__()

self.time_freq_dim = time_freq_dim
self.timesteps_proj = get_1d_sincos_pos_embed_from_grid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this to not be the Timesteps layer from

class Timesteps(nn.Module):
? IIUC, the modeling structure is the same as Wan here, so curious to know why it has to be this way.

If it must be this way, could we create at SkyReelsV2Timesteps class that calls get_1d_sincos_pos_embed_from_grid in its forward?

Copy link
Contributor Author

@tolgacangoz tolgacangoz Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, it was indeed Timesteps. However, I then realized that they were not quite the same. This is the calculation of sinusoidal timestep embeddings from the original repo:

def sinusoidal_embedding_1d(dim, position):
    # preprocess
    assert dim % 2 == 0
    half = dim // 2
    position = position.type(torch.float64)

    # calculation
    sinusoid = torch.outer(position, torch.pow(10000, -torch.arange(half).to(position).div(half)))
    x = torch.cat([torch.cos(sinusoid), torch.sin(sinusoid)], dim=1)
    return x

...

            e = self.time_embedding(
                sinusoidal_embedding_1d(self.freq_dim, t.flatten()).to(self.patch_embedding.weight.dtype)
            )  # b, dim
            e0 = self.time_projection(e).unflatten(1, (6, self.dim))  # b, 6, dim

...

Additionally, checking assert len(timesteps.shape) == 1, "Timesteps should be a 1d-array" in get_timestep_embedding of Timesteps doesn't comply with the diffusion forcing framework.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of proper serialization and module registration purposes, I am creating SkyReelsV2Timesteps 👍. Adding a flag for choosing get_1d_sincos_pos_embed_from_grid in Timesteps would be inappropriate to diffusers' style, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered and took a look at the Wan repo and found that it uses the same sinusoidal_embedding_1d function. Thus, was the porting of Wan into diffusers not accurate enough?

Copy link
Contributor Author

@tolgacangoz tolgacangoz Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. It turns out that my porting was not proper 🥲.

Copy link
Contributor Author

@tolgacangoz tolgacangoz Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't have seen this equality properly 🤦: e^(-ln(x)*a) = 1/(x^a).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference is precisions: float64-float32; but the error value is torch.max(out1 - out2) -> tensor(2.5402e-07, dtype=torch.float64), thus not important.

But there is still this verification in get_timestep_embedding of Timesteps: assert len(timesteps.shape) == 1, "Timesteps should be a 1d-array". We have different timesteps for each latent frame in the diffusion forcing framework, so what to do here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have different timesteps for each latent frame in the diffusion forcing framework, so what to do here?

I think you can copy-paste the function implementation in the SkyReels transformer file and modify as you see fit -- thanks for wrapping the logic in a layer!

Copy link
Contributor Author

@tolgacangoz tolgacangoz Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be a bit confused: Isn't this what you suggested?

Comment on lines 300 to 302
elif temb.dim() == 4:
e = (self.scale_shift_table.unsqueeze(2) + temb.float()).chunk(6, dim=1)
shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = [ei.squeeze(1) for ei in e]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit unsure in which case we end up having a 4D temb? Could you point me to the LoC?

Copy link
Contributor Author

@tolgacangoz tolgacangoz Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/tolgacangoz/diffusers/blob/ee99fc0fc0bc56589400bd3daa381f82d1e7a882/src/diffusers/models/transformers/transformer_skyreels_v2.py#L509-L515

        if enable_diffusion_forcing:
            b, f = timestep.shape
            temb = temb.view(b, f, 1, 1, -1)
            timestep_proj = timestep_proj.view(b, f, 1, 1, 6, -1)       # dim is 6
            temb = temb.repeat(1, 1, post_patch_height, post_patch_width, 1).flatten(1, 3)
            timestep_proj = timestep_proj.repeat(1, 1, post_patch_height,
							   post_patch_width, 1, 1).flatten(1, 3)    # dim is 4
            timestep_proj = timestep_proj.transpose(1, 2).contiguous()  # still 4

These are the lines right before the SkyReelsV2TransformerBlock.forward(hidden_states, encoder_hidden_states, timestep_proj, rotary_emb, causal_mask,).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible to make a note on shape in the comments?

tolgacangoz and others added 9 commits June 25, 2025 08:54
Replaces `print()` calls with `logger.debug()` for reporting progress during long video generation in SkyReelsV2DF pipelines.

This change reduces console output verbosity for standard runs while allowing developers to view progress by enabling debug-level logging.
Extract the sinusoidal timestep embedding logic into a new `SkyReelsV2Timesteps` `nn.Module`.

This change encapsulates the embedding generation, which simplifies the `SkyReelsV2TimeTextImageEmbedding` class and improves code modularity.
Reshapes the timestep embedding tensor to match the original input shape.

This ensures that batched timestep inputs retain their batch dimension after embedding, preventing potential shape mismatches.
@tolgacangoz tolgacangoz requested a review from a-r-r-o-w June 29, 2025 15:05
@a-r-r-o-w
Copy link
Member

cc @yiyixuxu could you take a final look + scheduler?

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tolgacangoz !
I left a few more comments, I'll ask SkyReels team for a review and think we can merge this soon:)

Comment on lines 300 to 302
elif temb.dim() == 4:
e = (self.scale_shift_table.unsqueeze(2) + temb.float()).chunk(6, dim=1)
shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = [ei.squeeze(1) for ei in e]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible to make a note on shape in the comments?

_supports_gradient_checkpointing = True
_skip_layerwise_casting_patterns = ["patch_embedding", "condition_embedder", "norm"]
_no_split_modules = ["SkyReelsV2TransformerBlock"]
_keep_in_fp32_modules = ["time_embedder", "scale_shift_table", "norm1", "norm2", "norm3"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a-r-r-o-w if we are going to use more on _keep_in_fp32_modules, I think we should start to look at the docs to not recommend ever use to(torch_dtype = ...) acros the library

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open a PR tomorrow adding a note

# When using multi-GPU inference via accelerate these will be on the
# first device rather than the last device, which hidden_states ends up
# on.
shift = shift.to(hidden_states.device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put these code in a ada layer norm and add that layer to _no_split_module - we should not be handling device for multi-GPU use casee

Copy link
Member

@a-r-r-o-w a-r-r-o-w Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from the Wan implementation. We can update this in a follow up, or update both together in this PR. Relevant issue: #10997

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh follow up is fine then

@tolgacangoz
Copy link
Contributor Author

Btw, the current usage of the scheduler, removing shift from the pipeline's call, produces slightly different timesteps, thus slightly different videos. I was investigating this, but didn't conclude; I will take a look at it again.

Also, before merging, should these two messages be investigated?

@a-r-r-o-w
Copy link
Member

@tolgacangoz I think it'll be good to decouple the FLF2V into a separate PR if the results are not good. I'm afraid I don't have the time to help in investigating the cause here right now, and this PR has been open for a really long time already and anticipated to be in master by many. Let's try to merge the ones that work for now :)

tolgacangoz and others added 5 commits July 2, 2025 14:37
Colocates the `SkyReelsV2Timesteps` class with the SkyReelsV2 transformer model.

This change moves model-specific timestep embedding logic from the general embeddings module to the transformer's own file, improving modularity and making the model more self-contained.
Replaces manual parameter iteration with the `get_parameter_dtype` helper to determine the time embedder's data type.

This change improves code readability and centralizes the logic.
@tolgacangoz
Copy link
Contributor Author

tolgacangoz commented Jul 2, 2025

Or, I think they can stay as a meaning of placeholder or potential feature, because the original code was the one that I cannot produce good results with 1.3B models for FLF2V. Or, it was I who couldn't run this task properly, idk :S. Maybe it is OK with larger models. I think this PR is well-suited for its job for integration.

Edit: I opened an issue at the original repo about this. I forgot to open earlier, sry 🥲.

@tolgacangoz tolgacangoz requested a review from yiyixuxu July 2, 2025 12:36
@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Jul 3, 2025

@tolgacangoz
are you able to refactor current FlowMatchUniPCMultistepSchedule instead of adding a new one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
roadmap Add to current release roadmap
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[Feature request] Integrate SkyReels-V2 support in diffusers
7 participants