Skip to content

[docs] Memory optims #11385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

stevhliu
Copy link
Member

@stevhliu stevhliu commented Apr 22, 2025

Refactors the memory optimization docs and combines it with working with big models (distributed setups).

Let me know if I'm missing anything!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@stevhliu stevhliu requested a review from sayakpaul April 22, 2025 23:16
@stevhliu stevhliu mentioned this pull request Apr 23, 2025
@stevhliu stevhliu marked this pull request as ready for review April 23, 2025 21:13
@Heasterian
Copy link

AutoencoderKLWan and AsymmetricAutoencoderKL does not support tiling or slicing (asymmetric have just unused flags), it most likely should be mentioned.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the initiative! I left some minor comments, let me know if they make sense.


Model offloading moves entire models to the GPU instead of selectively moving *some* layers or model components. One of the main pipeline models, usually the text encoder, UNet, and VAE, is placed on the GPU while the other components are held on the CPU. Components like the UNet that run multiple times stays on the GPU until its completely finished and no longer needed. This eliminates the communication overhead of [CPU offloading](#cpu-offloading) and makes model offloading a faster alternative. The tradeoff is memory savings won't be as large.

> [!WARNING]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a warning after enable_sequential_cpu_offload() that it's terribly slow and can often appear to be impractical?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have several sentences about this in the ## CPU offloading section and I bolded that it is extremely slow. I think that should do the trick :)

Comment on lines +214 to +215
> [!WARNING]
> Group offloading may not work with all models if the forward implementation contains weight-dependent device casting of inputs because it may clash with group offloading's device casting mechanism.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing some example models would be helpful here. Cc: @a-r-r-o-w

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I recall any official model implementation in transformers/diffusers off the top of my head. Basically, if you cast inputs by peeking into the device of a particular weight layer in a model, it might fail. I'll try to find/remember an example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know when you remember them and we can update in a later PR!


[`~StableDiffusionPipeline.enable_sequential_cpu_offload`] is a stateful operation that installs hooks on the models.
VAE tiling saves memory by dividing an image into smaller overlapping tiles instead of processing the entire image at once. This also reduces peak memory usage because the GPU is only processing a tile at a time. Unlike sliced VAE, tiled VAE maintains some context between tiles because they overlap which can generate more coherent images.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sliced VAE is simply breaking a larger batch of data into batch_size=1 data and sequentially processing it. The comparison here about slicesd VAE vs tiled VAE maintaing context between tiles seems incorrect

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed comparison!

Comment on lines +214 to +215
> [!WARNING]
> Group offloading may not work with all models if the forward implementation contains weight-dependent device casting of inputs because it may clash with group offloading's device casting mechanism.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I recall any official model implementation in transformers/diffusers off the top of my head. Basically, if you cast inputs by peeking into the device of a particular weight layer in a model, it might fail. I'll try to find/remember an example

@stevhliu
Copy link
Member Author

Thanks for the feedback on the memory doc! I also updated the inference speed doc, so please feel free to check it out and leave some feedback ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants