Question about Muon's Different Performance on Dense vs MoE Models

Dear @KellerJordan ,

I've been experimenting with the Muon optimizer and noticed an interesting pattern that I'd like to seek your insights on.

In my tests:

For dense models (10B-100B): Muon achieves excellent results, matching Adam's performance with 50% fewer training tokens
For MoE models (1B-246B): The improvements are much smaller, only 5-20% better than Adam, sometimes just comparable
Do you have any thoughts on why Muon's effectiveness differs so significantly between dense and MoE architectures?

Thank you for your time.

Best regards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about Muon's Different Performance on Dense vs MoE Models #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question about Muon's Different Performance on Dense vs MoE Models #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions