-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Add SkyReels V2: Infinite-Length Film Generative Model #11518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It's about time. Thanks. |
Mid-PR questions:
|
@tolgacangoz Thanks for working on this, really cool work so far!
2 and 3. I think in this case, we should have separate implementation of SkyReelsV2 and Wan due to the autoregressive nature of the former. Adding any extra code in Wan might complicate it for readers. Will let @yiyixuxu comment on this though
|
FWIW, I have been successful in using the same T5 encoder for WAN 2.1 for this model just by fiddling with their pipeline:
Then this: I incorporate my bitsandbytes nf4 transformer, their tokenizer and the WAN based T5 encoder:
I need to add this function to the pipeline for the T5 encoder to work:
|
It seems appropriate to me. Only Diffusion Forcing pipelines are different for large models. How are the results with your setting? |
…correct dimensions for latent model input and noise application.
Hi @yiyixuxu @a-r-r-o-w and SkyReels Team @yjp999 @pftq @Langdx @guibinchen ... This PR will be ready for review for |
…reflect the actual length of the step matrix, ensuring accurate progress tracking during inference.
…ForcingPipeline` to improve robustness during inference. Added try-except blocks for better error reporting and streamlined tensor operations for noise application and latent updates.
…ionForcingPipeline` to enhance clarity and efficiency. Updated progress bar total to match the number of inference steps, ensuring accurate tracking. Streamlined the handling of latent model inputs and noise predictions for improved performance during inference.
…orcingPipeline` to improve clarity and maintainability. Updated prefix video latent length variables for consistency and corrected tensor slicing to ensure proper dimensions during processing.
… consecutive frames
…ading sharded model files and updating model configuration. Refactor model loading logic to accommodate new model types and ensure proper initialization of components such as the tokenizer and scheduler.
… support new model type `SkyReelsV2-DF-14B-540P`. Adjusted parameters including `in_channels`, `added_kv_proj_dim`, and `inject_sample_info`. Refactored sharded model loading logic to accommodate varying shard counts based on model type.
…to use `register_to_config` for setting configuration parameters. This change improves clarity and maintains consistency in model configuration handling.
…mask` based on configuration flag. This change enhances flexibility in model behavior during training and inference.
…ensure consistency and correct functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the iterations @tolgacangoz! PR looks very close and I just have minor comments on my end. Off to @yiyixuxu
> [!TIP] | ||
> Click on the SkyReels-V2 models in the right sidebar for more examples of video generation. | ||
|
||
### A _Visual_ Demonstration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is great! Thank you! cc @stevhliu
super().__init__() | ||
|
||
self.time_freq_dim = time_freq_dim | ||
self.timesteps_proj = get_1d_sincos_pos_embed_from_grid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this to not be the Timesteps
layer from
diffusers/src/diffusers/models/embeddings.py
Line 1320 in d3e27e0
class Timesteps(nn.Module): |
If it must be this way, could we create at SkyReelsV2Timesteps class that calls get_1d_sincos_pos_embed_from_grid
in its forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, it was indeed Timesteps
. However, I then realized that they were not quite the same. This is the calculation of sinusoidal timestep embeddings from the original repo:
def sinusoidal_embedding_1d(dim, position):
# preprocess
assert dim % 2 == 0
half = dim // 2
position = position.type(torch.float64)
# calculation
sinusoid = torch.outer(position, torch.pow(10000, -torch.arange(half).to(position).div(half)))
x = torch.cat([torch.cos(sinusoid), torch.sin(sinusoid)], dim=1)
return x
...
e = self.time_embedding(
sinusoidal_embedding_1d(self.freq_dim, t.flatten()).to(self.patch_embedding.weight.dtype)
) # b, dim
e0 = self.time_projection(e).unflatten(1, (6, self.dim)) # b, 6, dim
...
Additionally, checking assert len(timesteps.shape) == 1, "Timesteps should be a 1d-array"
in get_timestep_embedding
of Timesteps
doesn't comply with the diffusion forcing framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of proper serialization and module registration purposes, I am creating SkyReelsV2Timesteps
👍. Adding a flag for choosing get_1d_sincos_pos_embed_from_grid
in Timesteps
would be inappropriate to diffusers
' style, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered and took a look at the Wan repo and found that it uses the same sinusoidal_embedding_1d
function. Thus, was the porting of Wan into diffusers
not accurate enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. It turns out that my porting was not proper 🥲.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't have seen this equality properly 🤦: e^(-ln(x)*a) = 1/(x^a)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference is precisions: float64-float32; but the error value is torch.max(out1 - out2) -> tensor(2.5402e-07, dtype=torch.float64)
, thus not important.
But there is still this verification in get_timestep_embedding
of Timesteps
: assert len(timesteps.shape) == 1, "Timesteps should be a 1d-array"
. We have different timesteps for each latent frame in the diffusion forcing framework, so what to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have different timesteps for each latent frame in the diffusion forcing framework, so what to do here?
I think you can copy-paste the function implementation in the SkyReels transformer file and modify as you see fit -- thanks for wrapping the logic in a layer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be a bit confused: Isn't this what you suggested?
src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing.py
Show resolved
Hide resolved
src/diffusers/pipelines/skyreels_v2/pipeline_skyreels_v2_diffusion_forcing.py
Outdated
Show resolved
Hide resolved
elif temb.dim() == 4: | ||
e = (self.scale_shift_table.unsqueeze(2) + temb.float()).chunk(6, dim=1) | ||
shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = [ei.squeeze(1) for ei in e] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit unsure in which case we end up having a 4D temb? Could you point me to the LoC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if enable_diffusion_forcing:
b, f = timestep.shape
temb = temb.view(b, f, 1, 1, -1)
timestep_proj = timestep_proj.view(b, f, 1, 1, 6, -1) # dim is 6
temb = temb.repeat(1, 1, post_patch_height, post_patch_width, 1).flatten(1, 3)
timestep_proj = timestep_proj.repeat(1, 1, post_patch_height,
post_patch_width, 1, 1).flatten(1, 3) # dim is 4
timestep_proj = timestep_proj.transpose(1, 2).contiguous() # still 4
These are the lines right before the SkyReelsV2TransformerBlock.forward(hidden_states, encoder_hidden_states, timestep_proj, rotary_emb, causal_mask,)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible to make a note on shape in the comments?
…its suggested location.
Replaces `print()` calls with `logger.debug()` for reporting progress during long video generation in SkyReelsV2DF pipelines. This change reduces console output verbosity for standard runs while allowing developers to view progress by enabling debug-level logging.
Extract the sinusoidal timestep embedding logic into a new `SkyReelsV2Timesteps` `nn.Module`. This change encapsulates the embedding generation, which simplifies the `SkyReelsV2TimeTextImageEmbedding` class and improves code modularity.
Reshapes the timestep embedding tensor to match the original input shape. This ensures that batched timestep inputs retain their batch dimension after embedding, preventing potential shape mismatches.
cc @yiyixuxu could you take a final look + scheduler? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tolgacangoz !
I left a few more comments, I'll ask SkyReels team for a review and think we can merge this soon:)
elif temb.dim() == 4: | ||
e = (self.scale_shift_table.unsqueeze(2) + temb.float()).chunk(6, dim=1) | ||
shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = [ei.squeeze(1) for ei in e] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible to make a note on shape in the comments?
_supports_gradient_checkpointing = True | ||
_skip_layerwise_casting_patterns = ["patch_embedding", "condition_embedder", "norm"] | ||
_no_split_modules = ["SkyReelsV2TransformerBlock"] | ||
_keep_in_fp32_modules = ["time_embedder", "scale_shift_table", "norm1", "norm2", "norm3"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-r-r-o-w if we are going to use more on _keep_in_fp32_modules
, I think we should start to look at the docs to not recommend ever use to(torch_dtype = ...)
acros the library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will open a PR tomorrow adding a note
# When using multi-GPU inference via accelerate these will be on the | ||
# first device rather than the last device, which hidden_states ends up | ||
# on. | ||
shift = shift.to(hidden_states.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put these code in a ada layer norm and add that layer to _no_split_module
- we should not be handling device for multi-GPU use casee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from the Wan implementation. We can update this in a follow up, or update both together in this PR. Relevant issue: #10997
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh follow up is fine then
Btw, the current usage of the scheduler, removing Also, before merging, should these two messages be investigated? |
@tolgacangoz I think it'll be good to decouple the FLF2V into a separate PR if the results are not good. I'm afraid I don't have the time to help in investigating the cause here right now, and this PR has been open for a really long time already and anticipated to be in master by many. Let's try to merge the ones that work for now :) |
Colocates the `SkyReelsV2Timesteps` class with the SkyReelsV2 transformer model. This change moves model-specific timestep embedding logic from the general embeddings module to the transformer's own file, improving modularity and making the model more self-contained.
Replaces manual parameter iteration with the `get_parameter_dtype` helper to determine the time embedder's data type. This change improves code readability and centralizes the logic.
Or, I think they can stay as a meaning of placeholder or potential feature, because the original code was the one that I cannot produce good results with 1.3B models for FLF2V. Or, it was I who couldn't run this task properly, idk :S. Maybe it is OK with larger models. I think this PR is well-suited for its job for integration. Edit: I opened an issue at the original repo about this. I forgot to open earlier, sry 🥲. |
@tolgacangoz |
Thanks for the opportunity to fix #11374!
Original Work
Original repo: https://github.com/SkyworkAI/SkyReels-V2
Paper: https://huggingface.co/papers/2504.13074
TODOs:
✅
FlowMatchUniPCMultistepScheduler
: just copy-pasted from the original repo✅
SkyReelsV2Transformer3DModel
: 90%WanTransformer3DModel
✅
SkyReelsV2DiffusionForcingPipeline
✅
SkyReelsV2DiffusionForcingImageToVideoPipeline
: Includes FLF2V.✅
SkyReelsV2DiffusionForcingVideoToVideoPipeline
: Extends a given video.✅
SkyReelsV2Pipeline
✅
SkyReelsV2ImageToVideoPipeline
: Includes FLF2V.✅
scripts/convert_skyreelsv2_to_diffusers.py
tolgacangoz/SkyReels-V2-Diffusers
⏳ Did you make sure to update the documentation with your changes? Did you write any new necessary tests?: We will construct these during review.
T2V with Diffusion Forcing (OLD)
diffusers
integrationoriginal_0_short.mp4
diffusers_0_short.mp4
diffusers
integrationoriginal_37_short.mp4
diffusers_37_short.mp4
diffusers
integrationoriginal_0_long.mp4
diffusers_0_long.mp4
diffusers
integrationoriginal_37_long.mp4
diffusers_37_long.mp4
I2V with Diffusion Forcing (OLD)
prompt
="A penguin dances."diffusers
integrationi2v-short.mp4
FLF2V with Diffusion Forcing (OLD)
Now, Houston, we have a problem.
I have been unable to produce good results with this task. I tried many hyperparameter combinations with the original code.
The first frame's latent (
torch.Size([1, 16, 1, 68, 120])
) is overwritten onto the first of25
frame latents oflatents
(torch.Size([1, 16, 25, 68, 120])). Then, the last frame's latent is concatenated, thuslatents
istorch.Size([1, 16, 26, 68, 120])
. After the denoising process, the length of the last frame latent is discarded at the end and then decoded by the VAE. I tried not concatenating the last frame but overwriting onto the latest frame oflatents
and not discarding the latest frame latent at the end, but still got bad results. Here are some results:0.mp4
1.mp4
2.mp4
3.mp4
4.mp4
5.mp4
6.mp4
7.mp4
V2V with Diffusion Forcing (OLD)
This pipeline extends a given video.
diffusers
integrationvideo1.mp4
v2v.mp4
Firstly, I want to congratulate you on this great work, and thanks for open-sourcing it, SkyReels Team! This PR proposes an integration of your model.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@yiyixuxu @a-r-r-o-w @linoytsaban @yjp999 @Howe2018 @RoseRollZhu @pftq @Langdx @guibinchen @qiudi0127 @nitinmukesh @tin2tin @ukaprch @okaris