Replies: 2 comments 2 replies
-
Use |
Beta Was this translation helpful? Give feedback.
-
The --cache-none in combination with --disable-smart-memory kinda does what I want. I don't know yet if that has any unwanted side effects. However I would strongly suggest that the Comfy team takes a look at the default settings again. The argument is: Every modern OS (I tested Windows 11, WSL on WIndows 11 and CachyOS Linux) will cache files loaded from disk in RAM anyway and keep them there as long as possible and replace them with later loaded files as soon as the buffer space is exhausted. This is a smart memory management on Operating System Level already. If Comfy does the same it takes twice the amount of resources and inevitably causes swapping on low RAM systems. And the offloading from VRAM to RAM takes unneccessary additional CPU / GPU Resources on top of that. This does not make any sense and the defaults should be changed accordingly. On WSL the effect is even more wasteful since the Host WIndows, the guest Linux and Comfy will cache the files. Even my 256 BG RAM main RIG quickly exhausts its resources under those circumstances. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there,
Is there a way to achieve that Comfy treats VRAM like i.e. Ollama does?
I would like to disable offloading to RAM but keep partitial loading for large models. Flux.dev for example doesn't quite fit in my 24 GB VRAM, but it's still faster/almost as fast and the output quality is a lot better than Q8 GGUF or fp8_*, especially when rendering images with text.
Currently Comfy does the following: Load Text Encoders to VRAM -> Use Text encoders -> Offload Text Encoders to RAM -> Particially Load Flux (about 95%, hardly any speed reduction) -> Render -> Offload Flux to RAM and repeat.
At the same time my OS (Linux) caches the models in RAM, too (in the buff/cache memory of the OS) So the models get cached in RAM twice.
What I'd like to see is: Load Text Encoders to VRAM -> Use Text Encoders -> Discard Text from VRAM -> Load Flux Partitially -> Render -> Discard Flux from VRAM / Shared VRAM -> Repeat
That would not only be faster but also more memory efficient.
When I use the the --high-vram or --gpu-only switch I run out of memory on the device.
When I use the --disable-smart-memory switch it unloads models immediately after using them but still offloads them to VRAM.
Thanks in advance,
Peter
Beta Was this translation helpful? Give feedback.
All reactions