Your study on learning rate is very helpful to me, but I still have some questions.
-
In your learning rate scaling law experiment, are all trainings reduced to 31.6% of the initial learning rate at 80% and 10% of the initial learning rate at 90% as mentioned before?
-
Does the learning rate used in the scaling law refer to the initial learning rate?
Looking forward to your reply.