Skip to content

Improve Wan docstrings #11689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 11, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions src/diffusers/pipelines/wan/pipeline_wan.py
Original file line number Diff line number Diff line change
Expand Up @@ -388,8 +388,10 @@ def __call__(

Args:
prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
instead.
The prompt or prompts to guide the image generation. If not defined, pass `prompt_embeds` instead.
negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to avoid during image generation. If not defined, pass `negative_prompt_embeds`
instead. Ignored when not using guidance (`guidance_scale` < `1`).
height (`int`, defaults to `480`):
The height in pixels of the generated image.
width (`int`, defaults to `832`):
Expand Down Expand Up @@ -434,8 +436,9 @@ def __call__(
The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
`._callback_tensor_inputs` attribute of your pipeline class.
autocast_dtype (`torch.dtype`, *optional*, defaults to `torch.bfloat16`):
The dtype to use for the torch.amp.autocast.
max_sequence_length (`int`, defaults to `512`):
The maximum sequence length of the text encoder. If the prompt is longer than this, it will be
truncated. If the prompt is shorter, it will be padded to this length.

Examples:

Expand Down
10 changes: 4 additions & 6 deletions src/diffusers/pipelines/wan/pipeline_wan_i2v.py
Original file line number Diff line number Diff line change
Expand Up @@ -562,12 +562,10 @@ def __call__(
The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
`._callback_tensor_inputs` attribute of your pipeline class.
max_sequence_length (`int`, *optional*, defaults to `512`):
The maximum sequence length of the prompt.
shift (`float`, *optional*, defaults to `5.0`):
The shift of the flow.
autocast_dtype (`torch.dtype`, *optional*, defaults to `torch.bfloat16`):
The dtype to use for the torch.amp.autocast.
max_sequence_length (`int`, defaults to `512`):
The maximum sequence length of the text encoder. If the prompt is longer than this, it will be
truncated. If the prompt is shorter, it will be padded to this length.

Examples:

Returns:
Expand Down
32 changes: 29 additions & 3 deletions src/diffusers/pipelines/wan/pipeline_wan_vace.py
Original file line number Diff line number Diff line change
Expand Up @@ -687,8 +687,33 @@ def __call__(

Args:
prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`
instead.
negative_prompt (`str` or `List[str]`, *optional*):
The prompt or prompts not to guide the image generation. If not defined, one has to pass
`negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
less than `1`).
video (`List[PIL.Image.Image]`, *optional*):
The input video or videos to be used as a starting point for the generation. The video should be a list
of PIL images, a numpy array, or a torch tensor. Currently, the pipeline only supports generating one
video at a time.
mask (`List[PIL.Image.Image]`, *optional*):
The input mask defines which video regions to condition on and which to generate. Black areas in the
mask indicate conditioning regions, while white areas indicate regions for generation. The mask should
be a list of PIL images, a numpy array, or a torch tensor. Currently supports generating a single video
at a time.
reference_images (`List[PIL.Image.Image]`, *optional*):
A list of one or more reference images as extra conditioning for the generation. For example, if you
are trying to inpaint a video to change the character, you can pass reference images of the new
character here. Refer to the Diffusers [examples](https://github.com/huggingface/diffusers/pull/11582)
and original [user
guide](https://github.com/ali-vilab/VACE/blob/0897c6d055d7d9ea9e191dce763006664d9780f8/UserGuide.md)
for a full list of supported tasks and use cases.
conditioning_scale (`float`, `List[float]`, `torch.Tensor`, defaults to `1.0`):
The conditioning scale to be applied when adding the control conditioning latent stream to the
denoising latent stream in each control layer of the model. If a float is provided, it will be applied
uniformly to all layers. If a list or tensor is provided, it should have the same length as the number
of control layers in the model (`len(transformer.config.vace_layers)`).
height (`int`, defaults to `480`):
The height in pixels of the generated image.
width (`int`, defaults to `832`):
Expand Down Expand Up @@ -733,8 +758,9 @@ def __call__(
The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
`._callback_tensor_inputs` attribute of your pipeline class.
autocast_dtype (`torch.dtype`, *optional*, defaults to `torch.bfloat16`):
The dtype to use for the torch.amp.autocast.
max_sequence_length (`int`, defaults to `512`):
The maximum sequence length of the text encoder. If the prompt is longer than this, it will be
truncated. If the prompt is shorter, it will be padded to this length.

Examples:

Expand Down
9 changes: 6 additions & 3 deletions src/diffusers/pipelines/wan/pipeline_wan_video2video.py
Original file line number Diff line number Diff line change
Expand Up @@ -508,7 +508,7 @@ def __call__(

Args:
prompt (`str` or `List[str]`, *optional*):
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`
instead.
height (`int`, defaults to `480`):
The height in pixels of the generated image.
Expand All @@ -525,6 +525,8 @@ def __call__(
of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
`guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
the text `prompt`, usually at the expense of lower image quality.
strength (`float`, defaults to `0.8`):
Higher strength leads to more differences between original image and generated video.
num_videos_per_prompt (`int`, *optional*, defaults to 1):
The number of images to generate per prompt.
generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
Expand Down Expand Up @@ -554,8 +556,9 @@ def __call__(
The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
`._callback_tensor_inputs` attribute of your pipeline class.
autocast_dtype (`torch.dtype`, *optional*, defaults to `torch.bfloat16`):
The dtype to use for the torch.amp.autocast.
max_sequence_length (`int`, defaults to `512`):
The maximum sequence length of the text encoder. If the prompt is longer than this, it will be
truncated. If the prompt is shorter, it will be padded to this length.

Examples:

Expand Down