You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been experimenting with the Muon optimizer and noticed an interesting pattern that I'd like to seek your insights on.
In my tests:
For dense models (10B-100B): Muon achieves excellent results, matching Adam's performance with 50% fewer training tokens
For MoE models (1B-246B): The improvements are much smaller, only 5-20% better than Adam, sometimes just comparable
Do you have any thoughts on why Muon's effectiveness differs so significantly between dense and MoE architectures?