Releases: HomebrewML/HeavyBall
Releases · HomebrewML/HeavyBall
Fixed SOAP, HVP PSGD
Bugfixes:
- @francois-rozet fixed a severe convergence regression in SOAP. It's now faster and converges better than before (#42)
- ADOPT now correctly matches the paper, significantly improving its convergence
- FP64 storage and/or computation now works for more optimizers
Improvements:
- NewtonPSGD now supports exact HVP calculation instead of the previous approximate. (Handles BatchNorm better but doesn't support all architectures.)
"smart_one_diag"
is a next-to-no-downsidesmemory_save_mode
for PSGD. It reduces memory and compute cost compared tomemory_save_mode=None
and improves convergence compared tomemory_save_mode="one_diag"
*
*Instead of preconditioning all dimensions (memory_save_mode=None
) or preconditioning all but the largest dimension (memory_save_mode="one_diag"
) we remove the largest dimension iff it's larger than the second largest. So, a Linear(128, 1024) will now create one 128x128 preconditioner (instead of 128x128 + 1024x1024, 8x as large as the parameters), while a Linear(128, 128) can still benefit from preconditioning both sides.
OrthoGrad & PSGD improvements
- General
precond_schedule
matches its docs (@francois-rozet, #31)- unified warmup_steps API (@francois-rozet, #32 )
- add
eps
arg toscale_by_adam
(#33) - allow external management of LR (for
foreach=True
optimizers)
- OrthoGrad, a "grokking-first" optimizer that works
- PSGD
- no more OOM in
torch.linalg.solve
- speed up cache by skipping it when it wouldn't give speedups
- add newton-PSGD ("hvp-PSGD") using finite-difference approximation
- caution momentum, not update (-> improved convergence; closer to paper's intention)
- no more OOM in
- Benchmarks
grokking
benchmark, using modular addition and wide MLPs
Fix PSGD, spring cleaning
- Previously, only the first parameter of PSGD was trained; This is fixed now
- All PSGDs were
PurePSGD
- nowmomentum_into_precond_update
andexp_avg_input
have their expected effect again - preliminary support for external changes of
group['lr']
v1.3.0
faster, less memory, minor fixes
- LaProp/Adam/... are now compilable
fused_hook
andhook_optimizer_into_model
, reducing memory usage by fusing backward pass with optimizer step- fewer inplace ops, giving better compilations and cleaner code
- scaling ("graft", "scale", "none") for Muon, allowing Adam#Muon at minimal cost
storage_dtype
argument is implemented again- LaProp is correctly implemented, ADOPT is more stable
- via @ethansmith2000: cleaner, more maintainable
defaults
, reducing the surface for potential errors
Stability, Muon and Fixes
utils
- bugfixes impacting SFAdamW and RMSProp
- breaking:
zeroth_power_method
no longer supportseigh
and doesn't allow specification of the number of newtonschulz iterations - faster newtonschulz5 (via @tysam-code)
- PSGD preconditioner dampening (via @evanatyourservice)
chainable
- implementation of
nesterov_momentum
,heavyball_momentum
andorthogonalize_update
- implementation of
- core
- heavyball.Muon (by chaining
nesterov_momentum
andorthogonalize_update
); Muon supports gradient and update clipping out of the box
- heavyball.Muon (by chaining