Skip to content

Finetuning with Muon #25

@dxqb

Description

@dxqb

Thanks for your work first of all!

You have posed this question in your write-up:

  • Is it possible that Muon works only for pretraining, and won’t work for finetuning or reinforcement learning workloads?

Here are some anecdotal experiments using LoRA finetuning of Flux Dev1, an image generation model that is notoriously difficult to finetune on multiple concepts/tasks. The reasons for this are unknown, but believed to be related to the fact that this model was not trained on raw data but distilled from a teacher model.

These are validation loss graphs
Image
Image
Image
Image
Image

The first one is the average, the other ones are the 4 different tasks that are trained. Green is Muon, Orange is AdamW. You can tell that while AdamW learns well for a while, it seems to hit a wall at some point where the tasks probably compete against each other and the validation loss becomes unstable. Muon seems to deal with this much better.

In the last graph in white, you can see for reference a single-task training of the same task using AdamW. It can easily get where Muon goes, but only if there are no competing tasks.
[The white line is a bit unstable because with 1 task it's only 20 training samples and it overfits fast. For the same reason, the validation loss goes back up after a minimum in all graphs]

The trainings above were with batch size 4 on 4 tasks. Here is the same 4 tasks but batch size only 1:

Image
Muon clearly outperforms AdamW again because of the competing tasks.

Feel free to close this issue if this is not the right place, but I thought this might be an interesting anecdote for you, because it indicates that Muon is not only faster to the goal, but can find better parameters for a difficult problem.

the AdamW hyperparameters I have used, I have used and tuned many times - Muon for the first time. So if there is any hyperparameter bias, it is probably in favor of AdamW.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions