Skip to content

Qwen Image pipeline, Various #12295

@dxqb

Description

@dxqb

Describe the bug

First of all, thank you for providing this pipeline! It makes it much easier to build tools on top of these models.
While doing that, I ran across various issues, big and small. I'll list them here - let me know if some of them should be a separate issue.

Here:
https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/pipelines/qwenimage/pipeline_qwenimage.py#L212C9-L212C74
The system prompt is removed from the hidden states.

The encoded hidden state of:

"<|im_start|>system\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n"

is turned into the encoded hidden state of:
{}<|im_end|>\n<|im_start|>assistant\n

I cannot find in the Qwen technical report that the hidden_states of the system prompt were removed during training, but I assume that someone involved with Qwen has provided this information.
But then, it's surprising that the remaining part of the system prompt after the user prompt is not removed: {}<|im_end|>\n<|im_start|>assistant\n remains in the encoder hidden state passed to the transformer.
Can you confirm that this is correct, and as the model was trained?

Unless I'm missing something, the splitting of the hidden states per batch sample here

split_hidden_states = self._extract_masked_hidden(hidden_states, txt_tokens.attention_mask)
and then stacking them back into batched hidden states, doesn't do anything. The end result is the same as the input.
The only thing it seems to do, is to zero the masked tokens. This should be irrelevant - because they are masked. But even if not, the same thing can be achieved by

prompt_embeds = hidden_states[:, drop_idx:,:] * attention_mask.unsqueeze(-1)

What was the intention of this code?

This seems unnecessary:

prompt_embeds_mask = prompt_embeds_mask[:, :max_sequence_length]

The prompt is already truncated by the tokenizer here:
txt, max_length=self.tokenizer_max_length + drop_idx, padding=True, truncation=True, return_tensors="pt"

If the intention is to have the option to truncate it further, why have the first truncation hardcoded to 1024 here, and then do it a second time?

self.tokenizer_max_length = 1024

Documentation of these parameters is missing:

img_shapes: Optional[List[Tuple[int, int, int]]] = None,

The type hint of img_shapes is a list of tuples.
But actually, a list of a list of tuples is passed here:

img_shapes = [[(1, height // self.vae_scale_factor // 2, width // self.vae_scale_factor // 2)]] * batch_size

Which is then converted from a list of a list of tuples into a list of tuples here:

but what's the intention here? Why is it even a list, do you plan to support different shapes in the same batch? Why is it a list of a list?

These defaults are incorrect for Qwen:

self.scheduler.config.get("max_image_seq_len", 4096),

Compare here: https://huggingface.co/Qwen/Qwen-Image/blob/main/scheduler/scheduler_config.json
I guess the defaults are not used but the actual scheduler config, but still.

The scheduler config here is contradictory:
https://huggingface.co/docs/diffusers/main/api/pipelines/qwenimage#lora-for-faster-inference

base_shift and max_shift is set to the same value. That means there is no dynamic timestep shifting. But dynamic timestep shifting is enabled.


Requires a AutoencoderKLQwenImage, not a AutoencoderKL

The shape here is wrong:

shape = (batch_size, 1, num_channels_latents, height, width)

At other points in the code, It's [B, C, F, H, W], not [B, F, C, H, W]. It doesn't have any effect currently with F always 1, because the shape is packed right after.

Reproduction

if any of the points above need reproduction, let me know and I'll provide code

Logs

System Info

diffusers HEAD

Who can help?

@yiyixuxu @DN6 @sayakpaul
CC @JingyaHuang @naykun because of this possibly related issue #12294

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions