-
Notifications
You must be signed in to change notification settings - Fork 77
Open
Description
As you mentioned in README, we should divide the model parameters into three parts:
hidden_weights = [p for p in model.body.parameters() if p.ndim >= 2]
hidden_gains_biases = [p for p in model.body.parameters() if p.ndim < 2]
nonhidden_params = [*model.head.parameters(), *model.embed.parameters()])
However, I still confused about whether to use muon for parameters of different model architecture.
- if my model is vision transformer with a encoder, a decoder and a dpt head for depth prediction task, should i set the patchify layer as nonhidden_params? should i set the entire dpt head as nonhidden_params?
- if my model is multi-modality llm, during alignment stage, should i optimize the projector (which usually a mlp projector) with muon? or it should be treated as a head?
Further more, would you like to offer a guideline for whether to use muon for different parameters?
Metadata
Metadata
Assignees
Labels
No labels