Add SkyReels V2: Infinite-Length Film Generative Model #11518

tolgacangoz · 2025-05-07T18:58:53Z

Thanks for the opportunity to fix #11374!

Original Work

Original repo: https://github.com/SkyworkAI/SkyReels-V2
Paper: https://huggingface.co/papers/2504.13074

SkyReels V2's main contributions are summarized as follow:
• Comprehensive video captioner that understand the shot language while capturing the general description of the video, which dramatically improve the prompt adherence.
• Motion-specific preference optimization enhances motion dynamics with a semi-automatic data collection pipeline.
• Effective Diffusion-forcing adaptation enables the generation of ultra-long videos and story generation capabilities, providing a robust framework for extending temporal coherence and narrative depth.
• SkyCaptioner-V1 and SkyReels-V2 series models including diffusion-forcing, text2video, image2video, camera director and elements2video models with various sizes (1.3B, 5B, 14B) are open-sourced.

TODOs:
✅ FlowMatchUniPCMultistepScheduler: just copy-pasted from the original repo
✅ SkyReelsV2Transformer3DModel: 90% WanTransformer3DModel
✅ SkyReelsV2DiffusionForcingPipeline
✅ SkyReelsV2DiffusionForcingImageToVideoPipeline: Includes FLF2V.
✅ SkyReelsV2DiffusionForcingVideoToVideoPipeline: Extends a given video.
✅ SkyReelsV2Pipeline
✅ SkyReelsV2ImageToVideoPipeline: Includes FLF2V.
✅ scripts/convert_skyreelsv2_to_diffusers.py

tolgacangoz/SkyReels-V2-Diffusers

⏳ Did you make sure to update the documentation with your changes? Did you write any new necessary tests?: We will construct these during review.

T2V with Diffusion Forcing (OLD)

Skywork/SkyReels-V2-DF-1.3B-540P
seed 0 and num_frames 97
Original repo	`diffusers` integration
original_0_short.mp4	diffusers_0_short.mp4

seed 37 and num_frames 97
Original repo	`diffusers` integration
original_37_short.mp4	diffusers_37_short.mp4

seed 0 and num_frames 257
Original repo	`diffusers` integration
original_0_long.mp4	diffusers_0_long.mp4

seed 37 and num_frames 257
Original repo	`diffusers` integration
original_37_long.mp4	diffusers_37_long.mp4

!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingPipeline
from diffusers.utils import export_to_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."

output = pipe(
    prompt=prompt,
    num_inference_steps=30,
    height=544,
    width=960,
    num_frames=97,
    ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "T2V.mp4", fps=24, quality=8)

"""
You can set `ar_step=5` to enable asynchronous inference. When asynchronous inference,
`causal_block_size=5` is recommended while it is not supposed to be set for
synchronous generation. Asynchronous inference will take more steps to diffuse the
whole sequence which means it will be SLOWER than synchronous mode. In our
experiments, asynchronous inference may improve the instruction following and visual consistent performance.
"""

I2V with Diffusion Forcing (OLD)

`prompt`="A penguin dances."	`diffusers` integration
	i2v-short.mp4

#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

image = load_image("Penguin from https://huggingface.co/tasks/image-to-video")
prompt = "A penguin dances."

output = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=97,
    #ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "I2V.mp4", fps=24, quality=8)

"""
When I set `ar_step=5` and `causal_block_size=5`, then the results seem really bad.
"""

FLF2V with Diffusion Forcing (OLD)

Now, Houston, we have a problem.
I have been unable to produce good results with this task. I tried many hyperparameter combinations with the original code.
The first frame's latent (torch.Size([1, 16, 1, 68, 120])) is overwritten onto the first of 25 frame latents of latents (torch.Size([1, 16, 25, 68, 120])). Then, the last frame's latent is concatenated, thus latents is torch.Size([1, 16, 26, 68, 120]). After the denoising process, the length of the last frame latent is discarded at the end and then decoded by the VAE. I tried not concatenating the last frame but overwriting onto the latest frame of latents and not discarding the latest frame latent at the end, but still got bad results. Here are some results:

First Frame	Last Frame

0.mp4	1.mp4
2.mp4	3.mp4
4.mp4	5.mp4
6.mp4	7.mp4

#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingImageToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
first_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png")
last_frame = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png")

output = pipe(
    image=first_frame,
    last_image=last_frame,
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=97,
    #ar_step=5,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=None,  # Number of frames to overlap for smooth transitions in long videos; 17 for long
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "FLF2V.mp4", fps=24, quality=8)

V2V with Diffusion Forcing (OLD)

This pipeline extends a given video.

Input Video	`diffusers` integration
video1.mp4	v2v.mp4

#!pip uninstall diffusers -yq
#!pip install git+https://github.com/tolgacangoz/diffusers.git@skyreels-v2 ftfy -q
import torch
from diffusers import AutoencoderKLWan, SkyReelsV2DiffusionForcingVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video

vae = AutoencoderKLWan.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			subfolder="vae",
			torch_dtype=torch.float32)
pipe = SkyReelsV2DiffusionForcingVideoToVideoPipeline.from_pretrained(
			"tolgacangoz/SkyReels-V2-DF-1.3B-540P-Diffusers",
			vae=vae,
			torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
#pipe.transformer.set_ar_attention(causal_block_size=5)

prompt = "CG animation style, a small blue bird flaps its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its continuing flight and the vastness of the sky from a close-up, low-angle perspective."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
video = load_video("Input video.mp4")

output = pipe(
    video=video,
    prompt=prompt,
    num_inference_steps=50,
    height=544,
    width=960,
    num_frames=120,
    base_num_frames=97,
    ar_step=0,  # Controls asynchronous inference (0 for synchronous mode)
    generator=torch.Generator(device="cpu").manual_seed(0),
    overlap_history=17,  # Number of frames to overlap for smooth transitions in long videos
    addnoise_condition=20,  # Improves consistency in long video generation
).frames[0]
export_to_video(output, "V2V.mp4", fps=24, quality=8)

Firstly, I want to congratulate you on this great work, and thanks for open-sourcing it, SkyReels Team! This PR proposes an integration of your model.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@yiyixuxu @a-r-r-o-w @linoytsaban @yjp999 @Howe2018 @RoseRollZhu @pftq @Langdx @guibinchen @qiudi0127 @nitinmukesh @tin2tin @ukaprch @okaris

ukaprch · 2025-05-08T15:47:38Z

It's about time. Thanks.

tolgacangoz · 2025-05-14T15:09:16Z

Hi @yiyixuxu @a-r-r-o-w

Mid-PR questions:

The issue was labelled as "contributions-welcome" but not as "community-examples". Also, the number of stars in this model surpassed that of SkyReels-V1. Thus, I located these pipelines in src/diffusers/pipelines/skyreels_v2/. Should I move these pipelines to examples/? Should I also split this PR for each pipeline (group)?
Just like SkyReels-V1 was based on HunyuanVideo, SkyReels-V2 is based on Wan, but some differences exist. I thought of moving the differences to the parent abstraction, i.e., pipeline code, so that we can use WanTransformer3DModel for both, but it didn't seem appropriate enough to me at first. But then, if we introduce Diffusion Forcing and AutoRegressive properties into WanTransformer3DModel (as native as possible, not with the exact diff below), it seems possible to me. You can examine the current diff between transformer_wan.py and transformer_skyreels_v2.py: https://www.diffchecker.com/U72HJ6ox/ WDYT?

Since SkyReels-V2 is not a completely new architecture, should I move its pipelines into src/diffusers/pipelines/wan/ similar to HunyuanSkyreelsImageToVideoPipeline, if SkyReels-V2 is seen as an official pipeline?
I am removing TeaCache-related code because it is planned for a modular extension, right? If this PR is required to move to examples/, then no need to remove, I think.
I came across this:

diffusers/src/diffusers/models/embeddings.py

Line 1153 in 01abfc8

/ (theta ** (torch.arange(0, dim, 2, dtype=freqs_dtype, device=pos.device)[: (dim // 2)] / dim))

At first, [: (dim // 2)] confused me :S Isn't it redundant? dim was already confirmed even with assert dim % 2 == 0. Can I remove [: (dim // 2)] in a separate PR?

a-r-r-o-w · 2025-05-15T10:18:01Z

@tolgacangoz Thanks for working on this, really cool work so far!

I think we should add SkyReels models in core diffusers, so src/ is fine.

2 and 3. I think in this case, we should have separate implementation of SkyReelsV2 and Wan due to the autoregressive nature of the former. Adding any extra code in Wan might complicate it for readers. Will let @yiyixuxu comment on this though

Yeah let's remove the cache code. We'll try to write a model agnostic implementation in future once more of the cache related code is stablized after adding some of the easier methods that are not too model intrusive (such as first block cache).
You're right, it's redundant. Let's remove in a separate PR

ukaprch · 2025-05-16T15:42:18Z

FWIW, I have been successful in using the same T5 encoder for WAN 2.1 for this model just by fiddling with their pipeline:

        print('Quantize text_encoder qint8')
        class  QuantizedT5EncoderModelForCausalLM (QuantizedTransformersModel):
            auto_class = UMT5EncoderModel
            auto_class.from_config = auto_class._from_config
        text_encoder = QuantizedT5EncoderModelForCausalLM.from_pretrained(
            "./wan quantro T2V-14B-720P Diffusers/basemodel/t5encodermodel_qint8"
        ).to(dtype=dtype)
        
        pipe = Text2VideoPipeline(
            model_path=model_path,
            transformer=transformer,
            text_encoder=text_encoder,
            tokenizer=tokenizer,
            weight_dtype=dtype)

Then this: I incorporate my bitsandbytes nf4 transformer, their tokenizer and the WAN based T5 encoder:

def __init__(
    self, model_path, transformer, text_encoder, tokenizer, device: str = "cuda", weight_dtype=torch.bfloat16, use_usp=False, offload=False,
):
    self.transformer = transformer          #get_transformer(model_path, 'cpu', weight_dtype)
    vae_model_path = os.path.join(model_path, "Wan2.1_VAE.pth")
    self.vae = get_vae(vae_model_path, 'cpu', weight_dtype=torch.float32)
    if text_encoder is not None:
        self.text_encoder = text_encoder        #get_text_encoder(model_path, 'cuda', weight_dtype)
    if tokenizer is not None: 
        self.tokenizer = tokenizer

I need to add this function to the pipeline for the T5 encoder to work:

def encode(self, texts):
    ids, mask = self.tokenizer(texts, return_mask=True, add_special_tokens=True)
    ids = ids.to(self.device)
    mask = mask.to(self.device)
    context = self.text_encoder(ids, mask)
    #seq_lens = mask.gt(0).sum(dim=1).long()
    context = context.last_hidden_state * mask.unsqueeze(-1)
    return context

tolgacangoz · 2025-05-19T08:43:45Z

It seems appropriate to me. Only Diffusion Forcing pipelines are different for large models. How are the results with your setting?

…correct dimensions for latent model input and noise application.

tolgacangoz · 2025-05-23T11:46:22Z

Hi @yiyixuxu @a-r-r-o-w and SkyReels Team @yjp999 @pftq @Langdx @guibinchen ...

This PR will be ready for review for SkyReelsV2Transformer3DModel and SkyReelsV2DiffusionForcingPipeline soon. Other pipelines will follow quickly after initial feedback...

…reflect the actual length of the step matrix, ensuring accurate progress tracking during inference.

…ForcingPipeline` to improve robustness during inference. Added try-except blocks for better error reporting and streamlined tensor operations for noise application and latent updates.

…ionForcingPipeline` to enhance clarity and efficiency. Updated progress bar total to match the number of inference steps, ensuring accurate tracking. Streamlined the handling of latent model inputs and noise predictions for improved performance during inference.

…orcingPipeline` to improve clarity and maintainability. Updated prefix video latent length variables for consistency and corrected tensor slicing to ensure proper dimensions during processing.

… consecutive frames

…ading sharded model files and updating model configuration. Refactor model loading logic to accommodate new model types and ensure proper initialization of components such as the tokenizer and scheduler.

… support new model type `SkyReelsV2-DF-14B-540P`. Adjusted parameters including `in_channels`, `added_kv_proj_dim`, and `inject_sample_info`. Refactored sharded model loading logic to accommodate varying shard counts based on model type.

…to use `register_to_config` for setting configuration parameters. This change improves clarity and maintains consistency in model configuration handling.

…mask` based on configuration flag. This change enhances flexibility in model behavior during training and inference.

…ensure consistency and correct functionality.

a-r-r-o-w

Thanks for the iterations @tolgacangoz! PR looks very close and I just have minor comments on my end. Off to @yiyixuxu

docs/source/en/api/loaders/lora.md

a-r-r-o-w · 2025-06-24T09:38:30Z

docs/source/en/api/pipelines/skyreels_v2.md

+> [!TIP]
+> Click on the SkyReels-V2 models in the right sidebar for more examples of video generation.
+
+### A _Visual_ Demonstration


Wow, this is great! Thank you! cc @stevhliu

a-r-r-o-w · 2025-06-24T17:28:58Z

src/diffusers/models/transformers/transformer_skyreels_v2.py

+        super().__init__()
+
+        self.time_freq_dim = time_freq_dim
+        self.timesteps_proj = get_1d_sincos_pos_embed_from_grid


What's the reason for this to not be the Timesteps layer from

diffusers/src/diffusers/models/embeddings.py

Line 1320 in d3e27e0

class Timesteps(nn.Module):

? IIUC, the modeling structure is the same as Wan here, so curious to know why it has to be this way.

If it must be this way, could we create at SkyReelsV2Timesteps class that calls get_1d_sincos_pos_embed_from_grid in its forward?

Initially, it was indeed Timesteps. However, I then realized that they were not quite the same. This is the calculation of sinusoidal timestep embeddings from the original repo:

def sinusoidal_embedding_1d(dim, position): # preprocess assert dim % 2 == 0 half = dim // 2 position = position.type(torch.float64) # calculation sinusoid = torch.outer(position, torch.pow(10000, -torch.arange(half).to(position).div(half))) x = torch.cat([torch.cos(sinusoid), torch.sin(sinusoid)], dim=1) return x ... e = self.time_embedding( sinusoidal_embedding_1d(self.freq_dim, t.flatten()).to(self.patch_embedding.weight.dtype) ) # b, dim e0 = self.time_projection(e).unflatten(1, (6, self.dim)) # b, 6, dim ...

Additionally, checking assert len(timesteps.shape) == 1, "Timesteps should be a 1d-array" in get_timestep_embedding of Timesteps doesn't comply with the diffusion forcing framework.

In terms of proper serialization and module registration purposes, I am creating SkyReelsV2Timesteps 👍. Adding a flag for choosing get_1d_sincos_pos_embed_from_grid in Timesteps would be inappropriate to diffusers' style, right?

I wondered and took a look at the Wan repo and found that it uses the same sinusoidal_embedding_1d function. Thus, was the porting of Wan into diffusers not accurate enough?

Nope. It turns out that my porting was not proper 🥲.

I couldn't have seen this equality properly 🤦: e^(-ln(x)*a) = 1/(x^a).

The only difference is precisions: float64-float32; but the error value is torch.max(out1 - out2) -> tensor(2.5402e-07, dtype=torch.float64), thus not important.

But there is still this verification in get_timestep_embedding of Timesteps: assert len(timesteps.shape) == 1, "Timesteps should be a 1d-array". We have different timesteps for each latent frame in the diffusion forcing framework, so what to do here?

We have different timesteps for each latent frame in the diffusion forcing framework, so what to do here?

I think you can copy-paste the function implementation in the SkyReels transformer file and modify as you see fit -- thanks for wrapping the logic in a layer!

I might be a bit confused: Isn't this what you suggested?

src/diffusers/models/transformers/transformer_skyreels_v2.py

src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing.py

a-r-r-o-w · 2025-06-24T17:41:35Z

src/diffusers/models/transformers/transformer_skyreels_v2.py

+        elif temb.dim() == 4:
+            e = (self.scale_shift_table.unsqueeze(2) + temb.float()).chunk(6, dim=1)
+            shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = [ei.squeeze(1) for ei in e]


A bit unsure in which case we end up having a 4D temb? Could you point me to the LoC?

https://github.com/tolgacangoz/diffusers/blob/ee99fc0fc0bc56589400bd3daa381f82d1e7a882/src/diffusers/models/transformers/transformer_skyreels_v2.py#L509-L515

if enable_diffusion_forcing: b, f = timestep.shape temb = temb.view(b, f, 1, 1, -1) timestep_proj = timestep_proj.view(b, f, 1, 1, 6, -1) # dim is 6 temb = temb.repeat(1, 1, post_patch_height, post_patch_width, 1).flatten(1, 3) timestep_proj = timestep_proj.repeat(1, 1, post_patch_height, post_patch_width, 1, 1).flatten(1, 3) # dim is 4 timestep_proj = timestep_proj.transpose(1, 2).contiguous() # still 4

These are the lines right before the SkyReelsV2TransformerBlock.forward(hidden_states, encoder_hidden_states, timestep_proj, rotary_emb, causal_mask,).

possible to make a note on shape in the comments?

…its suggested location.

Replaces `print()` calls with `logger.debug()` for reporting progress during long video generation in SkyReelsV2DF pipelines. This change reduces console output verbosity for standard runs while allowing developers to view progress by enabling debug-level logging.

Extract the sinusoidal timestep embedding logic into a new `SkyReelsV2Timesteps` `nn.Module`. This change encapsulates the embedding generation, which simplifies the `SkyReelsV2TimeTextImageEmbedding` class and improves code modularity.

Reshapes the timestep embedding tensor to match the original input shape. This ensures that batched timestep inputs retain their batch dimension after embedding, preventing potential shape mismatches.

a-r-r-o-w · 2025-06-30T09:16:20Z

cc @yiyixuxu could you take a final look + scheduler?

yiyixuxu

Thanks @tolgacangoz !
I left a few more comments, I'll ask SkyReels team for a review and think we can merge this soon:)

src/diffusers/loaders/lora_pipeline.py

src/diffusers/models/embeddings.py

src/diffusers/models/transformers/transformer_skyreels_v2.py

yiyixuxu · 2025-07-01T21:17:28Z

src/diffusers/models/transformers/transformer_skyreels_v2.py

+        elif temb.dim() == 4:
+            e = (self.scale_shift_table.unsqueeze(2) + temb.float()).chunk(6, dim=1)
+            shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = [ei.squeeze(1) for ei in e]


possible to make a note on shape in the comments?

yiyixuxu · 2025-07-01T21:21:00Z

src/diffusers/models/transformers/transformer_skyreels_v2.py

+    _supports_gradient_checkpointing = True
+    _skip_layerwise_casting_patterns = ["patch_embedding", "condition_embedder", "norm"]
+    _no_split_modules = ["SkyReelsV2TransformerBlock"]
+    _keep_in_fp32_modules = ["time_embedder", "scale_shift_table", "norm1", "norm2", "norm3"]


@a-r-r-o-w if we are going to use more on _keep_in_fp32_modules, I think we should start to look at the docs to not recommend ever use to(torch_dtype = ...) acros the library

Will open a PR tomorrow adding a note

yiyixuxu · 2025-07-01T22:25:08Z

src/diffusers/models/transformers/transformer_skyreels_v2.py

+        # When using multi-GPU inference via accelerate these will be on the
+        # first device rather than the last device, which hidden_states ends up
+        # on.
+        shift = shift.to(hidden_states.device)


can we put these code in a ada layer norm and add that layer to _no_split_module - we should not be handling device for multi-GPU use casee

This is copied from the Wan implementation. We can update this in a follow up, or update both together in this PR. Relevant issue: #10997

ohh follow up is fine then

tolgacangoz · 2025-07-02T10:13:41Z

Btw, the current usage of the scheduler, removing shift from the pipeline's call, produces slightly different timesteps, thus slightly different videos. I was investigating this, but didn't conclude; I will take a look at it again.

Also, before merging, should these two messages be investigated?

a-r-r-o-w · 2025-07-02T11:20:20Z

@tolgacangoz I think it'll be good to decouple the FLF2V into a separate PR if the results are not good. I'm afraid I don't have the time to help in investigating the cause here right now, and this PR has been open for a really long time already and anticipated to be in master by many. Let's try to merge the ones that work for now :)

Colocates the `SkyReelsV2Timesteps` class with the SkyReelsV2 transformer model. This change moves model-specific timestep embedding logic from the general embeddings module to the transformer's own file, improving modularity and making the model more self-contained.

Replaces manual parameter iteration with the `get_parameter_dtype` helper to determine the time embedder's data type. This change improves code readability and centralizes the logic.

tolgacangoz · 2025-07-02T12:36:02Z

Or, I think they can stay as a meaning of placeholder or potential feature, because the original code was the one that I cannot produce good results with 1.3B models for FLF2V. Or, it was I who couldn't run this task properly, idk :S. Maybe it is OK with larger models. I think this PR is well-suited for its job for integration.

Edit: I opened an issue at the original repo about this. I forgot to open earlier, sry 🥲.

yiyixuxu · 2025-07-03T01:04:04Z

@tolgacangoz
are you able to refactor current FlowMatchUniPCMultistepSchedule instead of adding a new one?

tolgacangoz changed the title ~~Add SkyReels-V2 pipelines~~ Add SkyReels V2: Infinite-Length Film Generative Model May 16, 2025

Fix tensor slicing in SkyReelsV2DiffusionForcingPipeline to ensure …

4479afc

…correct dimensions for latent model input and noise application.

tolgacangoz added 22 commits May 23, 2025 14:57

Update progress bar total in SkyReelsV2DiffusionForcingPipeline to …

e4f6743

…reflect the actual length of the step matrix, ensuring accurate progress tracking during inference.

Refactor error handling and tensor processing in `SkyReelsV2Diffusion…

7420446

…ForcingPipeline` to improve robustness during inference. Added try-except blocks for better error reporting and streamlined tensor operations for noise application and latent updates.

Refactor variable naming and tensor handling in `SkyReelsV2DiffusionF…

8af4a9f

…orcingPipeline` to improve clarity and maintainability. Updated prefix video latent length variables for consistency and corrected tensor slicing to ensure proper dimensions during processing.

style

57a2bf9

fix number of frames for long video generation

ae6adbe

up

9afb214

fix: latents initialization for long video generation in processing…

f1483ad

… consecutive frames

update templates

a16c31b

Enhance convert_skyreelsv2_to_diffusers.py by adding support for lo…

3b7b63b

…ading sharded model files and updating model configuration. Refactor model loading logic to accommodate new model types and ensure proper initialization of components such as the tokenizer and scheduler.

Refactor set_ar_attention method in SkyReelsV2Transformer3DModel …

820d415

…to use `register_to_config` for setting configuration parameters. This change improves clarity and maintains consistency in model configuration handling.

up

528e0d7

up

6c4301c

upp

7d5328f

fix file name

00849fd

Update SkyReelsV2Transformer3DModel to conditionally apply `causal_…

8e34d89

…mask` based on configuration flag. This change enhances flexibility in model behavior during training and inference.

Merge branch 'main' into skyreels-v2

493a08c

style

a6f0d11

Fix class name casing for SkyReelsV2 components in multiple files to …

cc0660c

…ensure consistency and correct functionality.

cleaning

14d8d7a

cleansing

85a1f90

tolgacangoz and others added 2 commits June 22, 2025 14:23

docs: Update SkyReels-V2 documentation

92a13a7

Merge branch 'main' into skyreels-v2

d9cb5ec

a-r-r-o-w reviewed Jun 24, 2025

View reviewed changes

tolgacangoz and others added 9 commits June 25, 2025 08:54

Move the initialization of the gradient_checkpointing attribute to …

8509b20

…its suggested location.

Merge branch 'main' into skyreels-v2

ee99fc0

Merge branch 'main' into skyreels-v2

ae4f30c

Fix: Preserve original shape in timestep embeddings

8016e42

Reshapes the timestep embedding tensor to match the original input shape. This ensures that batched timestep inputs retain their batch dimension after embedding, preventing potential shape mismatches.

Merge branch 'main' into skyreels-v2

e04d78b

Merge branch 'main' into skyreels-v2

f62b080

Merge branch 'main' into skyreels-v2

2b100d7

tolgacangoz requested a review from a-r-r-o-w June 29, 2025 15:05

tolgacangoz and others added 2 commits June 30, 2025 12:45

style

122bfcd

Merge branch 'main' into skyreels-v2

9370662

yiyixuxu reviewed Jul 1, 2025

View reviewed changes

tolgacangoz and others added 5 commits July 2, 2025 14:37

Refactor parameter dtype retrieval to use utility function

ae045c7

Replaces manual parameter iteration with the `get_parameter_dtype` helper to determine the time embedder's data type. This change improves code readability and centralizes the logic.

Add comments to track the tensor shape transformations

4e66dac

Add copied froms

2326707

Merge branch 'main' into skyreels-v2

273fdd6

tolgacangoz requested a review from yiyixuxu July 2, 2025 12:36

tolgacangoz added 2 commits July 2, 2025 16:34

style

558cab2

fix-copies

8f435fe

tolgacangoz mentioned this pull request Jul 2, 2025

Refactor output normalization in several transformers #11850

Draft

Add SkyReels V2: Infinite-Length Film Generative Model #11518

Are you sure you want to change the base?

Add SkyReels V2: Infinite-Length Film Generative Model #11518

Conversation

tolgacangoz commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Original Work

T2V with Diffusion Forcing (OLD)

I2V with Diffusion Forcing (OLD)

FLF2V with Diffusion Forcing (OLD)

V2V with Diffusion Forcing (OLD)

Who can review?

Uh oh!

ukaprch commented May 8, 2025

Uh oh!

tolgacangoz commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a-r-r-o-w commented May 15, 2025

Uh oh!

ukaprch commented May 16, 2025

Uh oh!

tolgacangoz commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tolgacangoz commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolgacangoz Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolgacangoz Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolgacangoz Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolgacangoz Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tolgacangoz Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w commented Jun 30, 2025

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tolgacangoz commented May 7, 2025 •

edited

Loading

tolgacangoz commented May 14, 2025 •

edited

Loading

tolgacangoz commented May 19, 2025 •

edited

Loading

tolgacangoz commented May 23, 2025 •

edited

Loading

tolgacangoz Jun 25, 2025 •

edited

Loading

tolgacangoz Jun 25, 2025 •

edited

Loading

tolgacangoz Jun 25, 2025 •

edited

Loading

tolgacangoz Jun 25, 2025 •

edited

Loading

tolgacangoz Jun 25, 2025 •

edited

Loading

a-r-r-o-w Jul 1, 2025 •

edited

Loading

tolgacangoz commented Jul 2, 2025 •

edited

Loading