Skip to content

Conversation

@Ratish1
Copy link

@Ratish1 Ratish1 commented Nov 5, 2025

What does this PR do?

This PR fixes a bug that causes the QwenImage model to crash when using context parallelism with a prompt whose sequence length is not divisible by the world size.

The fix is implemented within the QwenImageTransformer2DModel and consists of three parts:

  1. Input Padding: The text prompt embeddings (encoder_hidden_states) and their attention mask are padded at the
    start of the forward method to ensure their length is divisible by the world size.
  2. RoPE Correction: The model is updated to use the new, padded sequence length to generate the rotary positional
    embeddings (RoPE), preventing a tensor shape mismatch that was causing a RuntimeError.
  3. Attention Masking: The QwenDoubleStreamAttnProcessor2_0 is corrected to build and use a proper additive attention mask. This ensures the new padded tokens are correctly ignored by the attention mechanism, preserving the numerical output of the model.

A new unit test is also added to simulate a distributed environment. It verifies that the padding logic prevents the crash while ensuring the output is numerically equivalent to the baseline, non-padded run.

Fixes #12568

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sayakpaul @yiyixuxu @DN6

@Ratish1 Ratish1 changed the title fix(qwenimage): Correct context parallelism padding fix(qwenimage): Add padding for context parallelism Nov 5, 2025
@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Nov 5, 2025

thanks for the PR! however, we will not want any of these logic go into qwen transformer
would you be interested to how to support this case( not just qwen) from the context parallel hooks
https://github.com/huggingface/diffusers/blob/main/src/diffusers/hooks/context_parallel.py#L204

@Ratish1
Copy link
Author

Ratish1 commented Nov 5, 2025

thanks for the PR! however, we will not want any of these logic go into qwen transformer would you be interested to how to support this case( not just qwen) from the context parallel hooks https://github.com/huggingface/diffusers/blob/main/src/diffusers/hooks/context_parallel.py#L204

Hi @yiyixuxu, yes I would be interested to support this change. So, should I revert my changes from the Qwen transformer and implement the padding logic inside the _prepare_cp_input function in context_parallel.py as you suggested?.

I have one follow up question regarding the model-specific consequences of this padding. The Qwen transformer calculates Rotary Positional Embeddings (RoPE) based on the original sequence length. When the hook pads the input tensor, the model still
needs to be aware of the new, padded length to avoid a shape mismatch inside the RoPE calculation.

Previously, I fixed this by recalculating the sequence length inside the transformer's forward method based on the padded tensor's shape. With the padding logic now in the hook, what is the preferred way to handle this?

Is it acceptable to keep that small, model-specific part of the logic (recalculating the sequence length for RoPE) inside the Qwen transformer, or is there a more general way to communicate the new padded length from the hook back to the model that I should use instead?. Thanks for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Context parallel bug when using QwenImage

2 participants