[wip][poc] make group offloading work with disk/nvme transfers #11682

sayakpaul · 2025-06-09T06:58:27Z

What does this PR do?

Group offloading is a crucial feature to provide a good speed-memory trade-off for large models on consumer hardware. However, since group offloading relies quite a bit on RAM usage, it can be bottlenecked by its availability. As such, for machines where GPU VRAM > available RAM or machines have limited RAM, group offloading can be far from ideal.

This PR takes a stab at supporting disk/NMVe serialization/deserialization inside group offloading so that users can use the secondary memory to onload/offload model params while also benefiting from the overlapping between compute and data transfer.

Below are some numbers I have gathered with this PR:

Mode	Time (s)	RAM (GB)	GPU (GB)
base	6.594	1.838	33.85
model CPU offload	21.682	35.504	22.64
sequential CPU offload	68.406	33.196	2.41
group offload	47.693	36.421	11.68
group offload with disk / NVMe support	55.467	2.814	11.68
same + compile	55.296	3.036	11.68

Code

from diffusers import DiffusionPipeline
import torch.utils.benchmark as benchmark
import torch
import psutil
import os
import json
import argparse

def benchmark_fn(f, *args, **kwargs):
    t0 = benchmark.Timer(
        stmt="f(*args, **kwargs)",
        globals={"args": args, "kwargs": kwargs, "f": f},
        num_threads=torch.get_num_threads(),
    )
    return f"{(t0.blocked_autorange().mean):.3f}"

def run_inference(pipe, pipe_kwargs):
    _ = pipe(**pipe_kwargs)

def initialize_pipeline():
    pipe = DiffusionPipeline.from_pretrained(
        "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16
    )
    pipe.set_progress_bar_config(disable=True)
    return pipe

def maybe_apply_offloading(pipe, args):
    if not args.model_cpu_offload and not args.seq_cpu_offload and not args.group_offload:
        pipe = pipe.to("cuda")
    else:
        if args.model_cpu_offload:
            pipe.enable_model_cpu_offload()
        elif args.seq_cpu_offload:
            pipe.enable_sequential_cpu_offload()
        elif args.group_offload:
            pipe.transformer.enable_group_offload(
                onload_device=torch.device("cuda"), 
                offload_device=torch.device("cpu"), 
                offload_type="block_level",
                num_blocks_per_group=1,
                use_stream=True,
                non_blocking=False,
                offload_to_disk=True if args.offload_to_disk else False,
                offload_path="." if args.offload_to_disk else None,
                record_stream=True
            )
            
            # For the rest of the components, just place on CUDA.
            for name, component in pipe.components.items():
                if name != "transformer" and isinstance(component, torch.nn.Module):
                    component.cuda()

    return pipe


def main(args):
    process = psutil.Process(os.getpid())
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.empty_cache()

    pipe = initialize_pipeline()
    pipe = maybe_apply_offloading(pipe, args)
    pipe_kwargs = {
        "prompt": "A cat holding a sign that says hello world",
        "height": 1024,
        "width": 1024,
        "guidance_scale": 3.5,
        "num_inference_steps": 28,
        "max_sequence_length": 512,
        "generator": torch.manual_seed(0),
    }
    time = benchmark_fn(run_inference, pipe, pipe_kwargs)
    inference_memory = torch.cuda.max_memory_allocated() / (1024 ** 3)
    inference_memory = float(f"{inference_memory:.2f}")
    ram_bytes = process.memory_info().rss
    ram_gb = ram_bytes / (1024 ** 3)

    # report
    print(f"Peak GPU memory: {inference_memory} GB")
    print(f"Resident CPU memory (RSS): {ram_gb:.2f} GB")

    prefix = "base"
    for key, value in vars(args).items():
        prefix += f"_{key}@{value}"
    
    image = pipe(**pipe_kwargs).images[0]
    image.save(f"{prefix}.png")
    
    artifact_dict = {"time": time, "memory": inference_memory, "ram": ram_gb}
    artifact_dict.update(vars(args))
    with open(f"{prefix}.json", "w") as f:
        json.dump(artifact_dict, f)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_cpu_offload", action="store_true")
    parser.add_argument("--seq_cpu_offload", action="store_true")
    parser.add_argument("--group_offload", action="store_true")
    parser.add_argument("--offload_to_disk", action="store_true")
    args = parser.parse_args()

    main(args)

Quality comparison:

The stark background color difference in regular group offloading exists in the main branch as well. So, I am not sure what is happening there.

Group offloading with disk serialization/deserialization works with torch.compile(), too.

This PR is a PoC and hence, it has some things that can be made better. I'd be fine if the PR is completely dropped or if someone else wants to take it over and see it to completion. Otherwise, I am completely fine working on it.

@asomoza I think you will be quite interested in this one.

HuggingFaceDocBuilderDev · 2025-06-09T07:05:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w

Really cool work at getting this started!

Can we see some results with more compute heavy model like Wan?
We probably need to look at some profiles to see if there is overlapping happening here when streams are used with disk-offload (reason: I think there's a blocking operation which prevents this, but not 100% sure)
re: stark background color difference; Weird, I'll take a look
can we also benchmark the disk memory usage?

Edit: For the benchmark, I think a fair comparison for all methods would require us to use group offloading on all components instead of just transformer. Maybe the benchmark could be updated to show the memory usages with (1) just transformer, (2) all components