-
Notifications
You must be signed in to change notification settings - Fork 77
Open
Description
Hi, thanks for your great work on this project!
I encountered an issue when using Muon with mixed precision training via Hugging Face Accelerator.
Problem:
When I wrap SingleDeviceMuonWithAuxAdam
using Accelerator with mixed_precision="bf16", the GPU memory usage does not decrease compared to full precision training. In contrast, when using the standard Adam optimizer, GPU consumption is reduced as expected under bf16 mixed precision.
Here’s a simplified version of my setup:
from accelerate import Accelerator
accelerator = (
Accelerator(
log_with="tensorboard",
mixed_precision='bf16',
gradient_accumulation_steps=2,
project_dir=os.environ.get('TRAIN_TF_EVENTS_PATH'),
)
if args.use_accelerate
else None
)
from muon import SingleDeviceMuonWithAuxAdam
hidden_params, nonhidden_params = model.get_hidden_nonhidden_params()
param_groups = [
dict(params=hidden_params, use_muon=True, lr=args.lr, weight_decay=0.01),
dict(params=nonhidden_params, use_muon=False, lr=args.lr, betas=(0.9, 0.98), weight_decay=0.01)
]
optimizer = SingleDeviceMuonWithAuxAdam(param_groups)
model, optimizer, scheduler, train_loader, valid_loader = accelerator.prepare(
model, optimizer, scheduler, train_loader, valid_loader
)
Observation:
With float32, Muon shows lower GPU consumption than Adam (as expected). (2613MB < 2803MB)
With bf16, Muon’s GPU usage is higher, and no memory saving is observed. (2695MB > 2679MB)
Metadata
Metadata
Assignees
Labels
No labels