-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[docs] Memory optims #11385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[docs] Memory optims #11385
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
AutoencoderKLWan and AsymmetricAutoencoderKL does not support tiling or slicing (asymmetric have just unused flags), it most likely should be mentioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the initiative! I left some minor comments, let me know if they make sense.
|
||
Model offloading moves entire models to the GPU instead of selectively moving *some* layers or model components. One of the main pipeline models, usually the text encoder, UNet, and VAE, is placed on the GPU while the other components are held on the CPU. Components like the UNet that run multiple times stays on the GPU until its completely finished and no longer needed. This eliminates the communication overhead of [CPU offloading](#cpu-offloading) and makes model offloading a faster alternative. The tradeoff is memory savings won't be as large. | ||
|
||
> [!WARNING] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add a warning
after enable_sequential_cpu_offload()
that it's terribly slow and can often appear to be impractical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have several sentences about this in the ## CPU offloading
section and I bolded that it is extremely slow. I think that should do the trick :)
> [!WARNING] | ||
> Group offloading may not work with all models if the forward implementation contains weight-dependent device casting of inputs because it may clash with group offloading's device casting mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Providing some example models would be helpful here. Cc: @a-r-r-o-w
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I recall any official model implementation in transformers/diffusers off the top of my head. Basically, if you cast inputs by peeking into the device of a particular weight layer in a model, it might fail. I'll try to find/remember an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know when you remember them and we can update in a later PR!
|
||
[`~StableDiffusionPipeline.enable_sequential_cpu_offload`] is a stateful operation that installs hooks on the models. | ||
VAE tiling saves memory by dividing an image into smaller overlapping tiles instead of processing the entire image at once. This also reduces peak memory usage because the GPU is only processing a tile at a time. Unlike sliced VAE, tiled VAE maintains some context between tiles because they overlap which can generate more coherent images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sliced VAE is simply breaking a larger batch of data into batch_size=1 data and sequentially processing it. The comparison here about slicesd VAE vs tiled VAE maintaing context between tiles seems incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed comparison!
> [!WARNING] | ||
> Group offloading may not work with all models if the forward implementation contains weight-dependent device casting of inputs because it may clash with group offloading's device casting mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I recall any official model implementation in transformers/diffusers off the top of my head. Basically, if you cast inputs by peeking into the device of a particular weight layer in a model, it might fail. I'll try to find/remember an example
Thanks for the feedback on the memory doc! I also updated the inference speed doc, so please feel free to check it out and leave some feedback ❤️ |
Refactors the memory optimization docs and combines it with working with big models (distributed setups).
Let me know if I'm missing anything!