-
Notifications
You must be signed in to change notification settings - Fork 77
Description
Thanks for your work first of all!
You have posed this question in your write-up:
- Is it possible that Muon works only for pretraining, and won’t work for finetuning or reinforcement learning workloads?
Here are some anecdotal experiments using LoRA finetuning of Flux Dev1, an image generation model that is notoriously difficult to finetune on multiple concepts/tasks. The reasons for this are unknown, but believed to be related to the fact that this model was not trained on raw data but distilled from a teacher model.
These are validation loss graphs
The first one is the average, the other ones are the 4 different tasks that are trained. Green is Muon, Orange is AdamW. You can tell that while AdamW learns well for a while, it seems to hit a wall at some point where the tasks probably compete against each other and the validation loss becomes unstable. Muon seems to deal with this much better.
In the last graph in white, you can see for reference a single-task training of the same task using AdamW. It can easily get where Muon goes, but only if there are no competing tasks.
[The white line is a bit unstable because with 1 task it's only 20 training samples and it overfits fast. For the same reason, the validation loss goes back up after a minimum in all graphs]
The trainings above were with batch size 4 on 4 tasks. Here is the same 4 tasks but batch size only 1:
Muon clearly outperforms AdamW again because of the competing tasks.
Feel free to close this issue if this is not the right place, but I thought this might be an interesting anecdote for you, because it indicates that Muon is not only faster to the goal, but can find better parameters for a difficult problem.
the AdamW hyperparameters I have used, I have used and tuned many times - Muon for the first time. So if there is any hyperparameter bias, it is probably in favor of AdamW.