Replies: 2 comments 3 replies
-
Thanks @lapp0! More insights the better! Would appreciate any comments you have
Very interesting! I will fall back to more traditional qkv. |
Beta Was this translation helpful? Give feedback.
-
I did enforce this with the new data, so shouldn't be a problem. Also allowed any sequence length to take advantage of flex attention.
Are you sure? I though in this implementation
I have not seen this warning. The longest single document token count in OMGprot50 should be less than 50000 tokens, so if fine-web has longer documents than that your hypothesis would make sense.
I actually recently converged upon 3/2 with some experimentation, which differs a bit from common practice. Neat!
Really great point. Turns out ESM2 was trained this way (able to select from any tokens), but intuitively it seems worse. I will go back to old behavior. Also, did you get rid of value embeddings? I didn't see them when skimming your code yesterday. Really curious if you found they are hampering performance at all. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been making changes to ESM2 for https://github.com/lapp0/kbert and thought I'd share some insights. If these are helpful I can continue doing this in the future.
qkv
You use
self.qkv = CastedLinear(dim, 3 * dim)
This will result in Muon orthogonalizing all LM heads together. I've experimented with this in the past for modded-nanogpt and it results in a faster steps, but less learning per step, netting less learning per minute.
10 step warmup
I'm not sure why 2-10 are skipped, but during initial eval/train steps, torch is compiling without/with grad, loading cache, etc. I don't think it's important to benchmark this.
LERPing replace/keep ratio
My 2.1906 run relies on LERPing a separate, extreme replace_keep probability, not a shared mlm prob. It's quite aggressive as well (15% -> 3%). Consider integrating this into your next run
6b8b990f906da47538e368541eea1ce5a902b1be
Padded Dataloader Bugs
I've run into a handful of bugs with my padded dataloader implementation, passing this along in case you run into any of these
.bin
files, otherwisetrain_loader.next_batch()
will be an empty tensor.Beta Was this translation helpful? Give feedback.
All reactions