Skip to content

Commit be05e58

Browse files
committed
ltx
1 parent 91901c7 commit be05e58

File tree

4 files changed

+134
-199
lines changed

4 files changed

+134
-199
lines changed

docs/source/en/api/pipelines/cogvideox.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323

2424
[CogVideoX](https://huggingface.co/papers/2408.06072) is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An "expert" transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos.
2525

26-
You can find all the original CogVideoX checkpoints under the CogVideoX [collection](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce).
26+
You can find all the original CogVideoX checkpoints under the [CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) collection.
2727

2828
> [!TIP]
2929
> Click on the CogVideoX models in the right sidebar for more examples of how to use CogVideoX for other video generation tasks.

docs/source/en/api/pipelines/hunyuan_video.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222

2323
[HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.
2424

25-
You can find all the original HunyuanVideo checkpoints under the Tencent [organization](https://huggingface.co/tencent).
25+
You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization.
2626

2727
> [!TIP]
2828
> The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers.
@@ -64,6 +64,8 @@ export_to_video(video, "output.mp4", fps=15)
6464
</hfoptions>
6565
<hfoption id="inference speed">
6666

67+
Compilation is slow the first time but subsequent calls to the pipeline are faster.
68+
6769
```py
6870
import torch
6971
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline

0 commit comments

Comments
 (0)