Skip to content

Muon optimizer shows no GPU memory reduction with Accelerator mixed precision (bf16) #48

@gitouni

Description

@gitouni

Hi, thanks for your great work on this project!

I encountered an issue when using Muon with mixed precision training via Hugging Face Accelerator.

Problem:
When I wrap SingleDeviceMuonWithAuxAdam using Accelerator with mixed_precision="bf16", the GPU memory usage does not decrease compared to full precision training. In contrast, when using the standard Adam optimizer, GPU consumption is reduced as expected under bf16 mixed precision.

Here’s a simplified version of my setup:

from accelerate import Accelerator
accelerator = (
        Accelerator(
            log_with="tensorboard",
            mixed_precision='bf16',
            gradient_accumulation_steps=2,
            project_dir=os.environ.get('TRAIN_TF_EVENTS_PATH'),
        )
        if args.use_accelerate
        else None
    )
from muon import SingleDeviceMuonWithAuxAdam
hidden_params, nonhidden_params = model.get_hidden_nonhidden_params()
param_groups = [
    dict(params=hidden_params, use_muon=True, lr=args.lr, weight_decay=0.01),
    dict(params=nonhidden_params, use_muon=False, lr=args.lr, betas=(0.9, 0.98), weight_decay=0.01)
]
optimizer = SingleDeviceMuonWithAuxAdam(param_groups)  
model, optimizer, scheduler, train_loader, valid_loader = accelerator.prepare(
            model, optimizer, scheduler, train_loader, valid_loader
        )

Observation:

With float32, Muon shows lower GPU consumption than Adam (as expected). (2613MB < 2803MB)

With bf16, Muon’s GPU usage is higher, and no memory saving is observed. (2695MB > 2679MB)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions