huggingface
diff --git a/‎docs/source/en/api/pipelines/cogvideox.md
Lines changed: 1 addition & 1 deletion b/‎docs/source/en/api/pipelines/cogvideox.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/en/api/pipelines/hunyuan_video.md
Lines changed: 3 additions & 1 deletion b/‎docs/source/en/api/pipelines/hunyuan_video.md
Lines changed: 3 additions & 1 deletion
@@ -23,7 +23,7 @@
 
 [CogVideoX](https://huggingface.co/papers/2408.06072) is a large diffusion transformer model - available in 2B and 5B parameters - designed to generate longer and more consistent videos from text. This model uses a 3D causal variational autoencoder to more efficiently process video data by reducing sequence length (and associated training compute) and preventing flickering in generated videos. An "expert" transformer with adaptive LayerNorm improves alignment between text and video, and 3D full attention helps accurately capture motion and time in generated videos.
 
-You can find all the original CogVideoX checkpoints under the CogVideoX [collection](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce).
+You can find all the original CogVideoX checkpoints under the [CogVideoX](https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce) collection.
 
 > [!TIP]
 > Click on the CogVideoX models in the right sidebar for more examples of how to use CogVideoX for other video generation tasks.
 
@@ -22,7 +22,7 @@
 
 [HunyuanVideo](https://huggingface.co/papers/2412.03603) is a 13B diffusion transformer model designed to be competitive with closed-source video foundation models and enable wider community access. This model uses a "dual-stream to single-stream" architecture to separately process the video and text tokens first, before concatenating and feeding them to the transformer to fuse the multimodal information. A pretrained multimodal large language model (MLLM) is used as the encoder because it has better image-text alignment, better image detail description and reasoning, and it can be used as a zero-shot learner if system instructions are added to user prompts. Finally, HunyuanVideo uses a 3D causal variational autoencoder to more efficiently process video data at the original resolution and frame rate.
 
-You can find all the original HunyuanVideo checkpoints under the Tencent [organization](https://huggingface.co/tencent).
+You can find all the original HunyuanVideo checkpoints under the [Tencent](https://huggingface.co/tencent) organization.
 
 > [!TIP]
 > The examples below use a checkpoint from [hunyuanvideo-community](https://huggingface.co/hunyuanvideo-community) because the weights are stored in a layout compatible with Diffusers.
@@ -64,6 +64,8 @@ export_to_video(video, "output.mp4", fps=15)
 </hfoptions>
 <hfoption id="inference speed">
 
+Compilation is slow the first time but subsequent calls to the pipeline are faster.
+
 ```py
 import torch
 from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline