-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Description
Describe the bug
First of all, thank you for providing this pipeline! It makes it much easier to build tools on top of these models.
While doing that, I ran across various issues, big and small. I'll list them here - let me know if some of them should be a separate issue.
Here:
https://github.com/huggingface/diffusers/blob/fc337d585309c4b032e8d0180bea683007219df1/src/diffusers/pipelines/qwenimage/pipeline_qwenimage.py#L212C9-L212C74
The system prompt is removed from the hidden states.
The encoded hidden state of:
"<|im_start|>system\nDescribe the image by detailing the color, shape, size, texture, quantity, text, spatial relationships of the objects and background:<|im_end|>\n<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n"
is turned into the encoded hidden state of:
{}<|im_end|>\n<|im_start|>assistant\n
I cannot find in the Qwen technical report that the hidden_states of the system prompt were removed during training, but I assume that someone involved with Qwen has provided this information.
But then, it's surprising that the remaining part of the system prompt after the user prompt is not removed: {}<|im_end|>\n<|im_start|>assistant\n
remains in the encoder hidden state passed to the transformer.
Can you confirm that this is correct, and as the model was trained?
Unless I'm missing something, the splitting of the hidden states per batch sample here
split_hidden_states = self._extract_masked_hidden(hidden_states, txt_tokens.attention_mask) |
The only thing it seems to do, is to zero the masked tokens. This should be irrelevant - because they are masked. But even if not, the same thing can be achieved by
prompt_embeds = hidden_states[:, drop_idx:,:] * attention_mask.unsqueeze(-1)
What was the intention of this code?
This seems unnecessary:
prompt_embeds_mask = prompt_embeds_mask[:, :max_sequence_length] |
The prompt is already truncated by the tokenizer here:
txt, max_length=self.tokenizer_max_length + drop_idx, padding=True, truncation=True, return_tensors="pt" |
If the intention is to have the option to truncate it further, why have the first truncation hardcoded to 1024 here, and then do it a second time?
self.tokenizer_max_length = 1024 |
Documentation of these parameters is missing:
img_shapes: Optional[List[Tuple[int, int, int]]] = None, |
The type hint of img_shapes
is a list of tuples.
But actually, a list of a list of tuples is passed here:
img_shapes = [[(1, height // self.vae_scale_factor // 2, width // self.vae_scale_factor // 2)]] * batch_size |
Which is then converted from a list of a list of tuples into a list of tuples here:
if isinstance(video_fhw, list): |
but what's the intention here? Why is it even a list, do you plan to support different shapes in the same batch? Why is it a list of a list?
These defaults are incorrect for Qwen:
self.scheduler.config.get("max_image_seq_len", 4096), |
Compare here: https://huggingface.co/Qwen/Qwen-Image/blob/main/scheduler/scheduler_config.json
I guess the defaults are not used but the actual scheduler config, but still.
The scheduler config here is contradictory:
https://huggingface.co/docs/diffusers/main/api/pipelines/qwenimage#lora-for-faster-inference
base_shift and max_shift is set to the same value. That means there is no dynamic timestep shifting. But dynamic timestep shifting is enabled.
vae ([`AutoencoderKL`]): |
Requires a AutoencoderKLQwenImage, not a AutoencoderKL
The shape here is wrong:
shape = (batch_size, 1, num_channels_latents, height, width) |
At other points in the code, It's [B, C, F, H, W], not [B, F, C, H, W]. It doesn't have any effect currently with F always 1, because the shape is packed right after.
Reproduction
if any of the points above need reproduction, let me know and I'll provide code
Logs
System Info
diffusers HEAD
Who can help?
@yiyixuxu @DN6 @sayakpaul
CC @JingyaHuang @naykun because of this possibly related issue #12294