Qwen Image pipeline, Various

### Describe the bug

First of all, thank you for providing this pipeline! It makes it much easier to build tools on top of these models.
While doing that, I ran across various issues, big and small. I'll list them here - let me know if some of them should be a separate issue.

1. 
Here:
https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/pipelines/qwenimage/pipeline_qwenimage.py#L212C9-L212C74
The system prompt is removed from the hidden states.

The encoded hidden state of:
```
"<|im_start|>system\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n"
```

is turned into the encoded hidden state of:
`{}<|im_end|>\n<|im_start|>assistant\n`

I cannot find in the Qwen technical report that the hidden_states of the system prompt were removed during training, but I assume that someone involved with Qwen has provided this information.
But then, it's surprising that the remaining part of the system prompt after the user prompt is *not* removed: `{}<|im_end|>\n<|im_start|>assistant\n` remains in the encoder hidden state passed to the transformer.
Can you confirm that this is correct, and as the model was trained?

2.
Unless I'm missing something, the splitting of the hidden states per batch sample here https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/pipelines/qwenimage/pipeline_qwenimage.py#L211 and then stacking them back into batched hidden states, doesn't do anything. The end result is the same as the input.
The only thing it seems to do, is to zero the masked tokens. This should be irrelevant - because they are masked. But even if not, the same thing can be achieved by

`prompt_embeds = hidden_states[:, drop_idx:,:] * attention_mask.unsqueeze(-1) `

What was the intention of this code?

3.
This seems unnecessary:
https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/pipelines/qwenimage/pipeline_qwenimage.py#L257
The prompt is already truncated by the tokenizer here:
https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/pipelines/qwenimage/pipeline_qwenimage.py#L203

If the intention is to have the option to truncate it further, why have the first truncation hardcoded to 1024 here, and then do it a second time?
https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/pipelines/qwenimage/pipeline_qwenimage.py#L175


4.
Documentation of these parameters is missing:
https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/models/transformers/transformer_qwenimage.py#L554

5. 
The type hint of `img_shapes` is a list of tuples.
But actually, a list *of a list* of tuples is passed here: https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/pipelines/qwenimage/pipeline_qwenimage.py#L612
Which is then converted from a *list of a list of tuples* into a *list of tuples* here:
https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/models/transformers/transformer_qwenimage.py#L206
but what's the intention here? Why is it even a list, do you plan to support different shapes in the same batch? Why is it a list of a list?

6.
These defaults are incorrect for Qwen:
https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/pipelines/qwenimage/pipeline_qwenimage.py#L620
Compare here: https://huggingface.co/Qwen/Qwen-Image/blob/main/scheduler/scheduler_config.json
I guess the defaults are not used but the actual scheduler config, but still.

7.
The scheduler config here is contradictory:
https://huggingface.co/docs/diffusers/main/api/pipelines/qwenimage#lora-for-faster-inference

base_shift and max_shift is set to the same value. That means there is no dynamic timestep shifting. But dynamic timestep shifting is enabled.

8.
https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/pipelines/qwenimage/pipeline_qwenimage.py#L141
Requires a AutoencoderKLQwenImage, not a AutoencoderKL

9.
The shape here is wrong:
https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/pipelines/qwenimage/pipeline_qwenimage.py#L391
At other points in the code, It's [B, C, F, H, W], not [B, F, C, H, W]. It doesn't have any effect currently with F always 1, because the shape is packed right after.



### Reproduction

if any of the points above need reproduction, let me know and I'll provide code

### Logs

```shell

```

### System Info

diffusers HEAD

### Who can help?

@yiyixuxu @DN6 @sayakpaul
CC @JingyaHuang @naykun because of this possibly related issue https://github.com/huggingface/diffusers/issues/12294

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen Image pipeline, Various #12295

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen Image pipeline, Various #12295

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions