Notes / Suggestions from Adapting ESM2 Code #10

lapp0 · 2025-01-04T18:01:38Z

lapp0
Jan 4, 2025
Collaborator

I've been making changes to ESM2 for https://github.com/lapp0/kbert and thought I'd share some insights. If these are helpful I can continue doing this in the future.

qkv

You use self.qkv = CastedLinear(dim, 3 * dim)

This will result in Muon orthogonalizing all LM heads together. I've experimented with this in the past for modded-nanogpt and it results in a faster steps, but less learning per step, netting less learning per minute.

10 step warmup

+        # This effectively ignores timing first 10 steps, which are slower for weird reasons.
+        # Alternately, and slightly more correctly in terms of benchmarking, we could do 10
+        # steps with dummy data first, and then re-initialize the model and reset the loader.
+        # TODO
+        # We should add this before the hackathon

I'm not sure why 2-10 are skipped, but during initial eval/train steps, torch is compiling without/with grad, loading cache, etc. I don't think it's important to benchmark this.

LERPing replace/keep ratio

My 2.1906 run relies on LERPing a separate, extreme replace_keep probability, not a shared mlm prob. It's quite aggressive as well (15% -> 3%). Consider integrating this into your next run

6b8b990f906da47538e368541eea1ce5a902b1be

Padded Dataloader Bugs

I've run into a handful of bugs with my padded dataloader implementation, passing this along in case you run into any of these

1. It produces an empty tensor if you exceed the epoch limit. Ensure you have sufficient packed data .bin files, otherwise train_loader.next_batch() will be an empty tensor.
1. Edge case: For about 0.01% of samples the BOS token isn't detected. I haven't determined why, but I turned the assertion into a warning: https://github.com/lapp0/kbert/blob/bc83428564f6a0a2281088f1dbf22a50db6627a6/dataloading.py#L84-L86

lhallee · 2025-01-05T02:11:30Z

lhallee
Jan 5, 2025
Maintainer

I've been making changes to ESM2 for https://github.com/lapp0/kbert and thought I'd share some insights. If these are helpful I can continue doing this in the future.

Thanks @lapp0! More insights the better! Would appreciate any comments you have

This will result in Muon orthogonalizing all LM heads together. I've experimented with this in the past for modded-nanogpt and it results in a faster steps, but less learning per step, netting less learning per minute.

Very interesting! I will fall back to more traditional qkv.

1 reply

lapp0 Jan 6, 2025
Collaborator Author

Was looking at your new code, good progress! Some comments:

Shuffling

Shuffling each epoch violates sequence contiguity unless your dataloader ensures each bin file starts with BOS and ends with EOS.

random.seed(self.next_shard) will result in the same shuffle for each subsequent epoch. I recommend self.random_shuffler = random.Random(42)

Dataloader Warning?

Have you received this warning? It's unclear to me whether this is an issue generally, or if I'm getting the warning it due to some fineweb-edu samples being extremely large.

-            assert sample[0] == 0 and sample[-1] == 2, (sample[0], sample[-1])
-            assert curr_batch_len < self.local_batch_size, curr_batch_len
+            if not sample[0] == self.cls_id and sample[-1] == self.eos_id:
+                print(f"Warning: sample[0]=={sample[0]}, sample[-1]=={sample[-1]}, sample.numel()=={sample.numel()}")
+                print(f"\ti={i}, eos_positions[:i]=={eos_positions[:i]}")
+            assert curr_batch_len < self.local_batch_size, str((curr_batch_len, self.local_batch_size))

FYI

While I selected a small MLP size, it may not be optimal, I was only attempting to squeeze my model into the ~40M range: expansion_ratio=3/2,
Behavior differs here:

-        self.tokenizer = tokenizer
+        super().__init__()
         self.mask_token_id = tokenizer.mask_token_id
-        self.special_tokens = torch.tensor(tokenizer.all_special_ids)
-        canonical_amino_acids = 'ACDEFGHIKLMNPQRSTVWY'
-        canonical_amino_acids_ids = tokenizer.convert_tokens_to_ids(list(canonical_amino_acids))
-        self.low_range = min(canonical_amino_acids_ids)
-        self.high_range = max(canonical_amino_acids_ids)
+        standard_tokens = [tokenizer.convert_tokens_to_ids(tok) for tok in tokenizer.all_tokens if tok not in tokenizer.all_special_tokens]
+        self.register_buffer("standard_tokens", torch.tensor(standard_tokens, dtype=torch.int32))
+        self.register_buffer("special_tokens", torch.tensor(tokenizer.all_special_ids, dtype=torch.int32))

Your old code would substitute with only lettered amino acids, now it may also substitute with these https://huggingface.co/facebook/esm2_t6_8M_UR50D/blob/main/vocab.txt#L29-L31

lhallee · 2025-01-06T18:23:27Z

lhallee
Jan 6, 2025
Maintainer

Shuffling each epoch violates sequence contiguity unless your dataloader ensures each bin file starts with BOS and ends with EOS.

I did enforce this with the new data, so shouldn't be a problem. Also allowed any sequence length to take advantage of flex attention.

random.seed(self.next_shard) will result in the same shuffle for each subsequent epoch. I recommend self.random_shuffler = random.Random(42)

Are you sure? I though in this implementation self.next_shard would grow forever, since we never reset and are adding += 1. This is fine since we grab from self.next_shard % len(self.files)?

Have you received this warning? It's unclear to me whether this is an issue generally, or if I'm getting the warning it due to some fineweb-edu samples being extremely large.

I have not seen this warning. The longest single document token count in OMGprot50 should be less than 50000 tokens, so if fine-web has longer documents than that your hypothesis would make sense.

While I selected a small MLP size, it may not be optimal, I was only attempting to squeeze my model into the ~40M range: expansion_ratio=3/2,

I actually recently converged upon 3/2 with some experimentation, which differs a bit from common practice. Neat!

Your old code would substitute with only lettered amino acids, now it may also substitute with these

Really great point. Turns out ESM2 was trained this way (able to select from any tokens), but intuitively it seems worse. I will go back to old behavior.

Also, did you get rid of value embeddings? I didn't see them when skimming your code yesterday. Really curious if you found they are hampering performance at all.

2 replies

lapp0 Jan 6, 2025
Collaborator Author

Are you sure? I though in this implementation self.next_shard would grow forever, since we never reset and are adding += 1. This is fine since we grab from self.next_shard % len(self.files)?

You're right, I was mistaken.

I have not seen this warning. The longest single document token count in OMGprot50 should be less than 50000 tokens, so if fine-web has longer documents than that your hypothesis would make sense.

Thanks, this helps me narrow down the bug.

I actually recently converged upon 3/2 with some experimentation, which differs a bit from common practice. Neat!

Really useful to know! Explains why ModernBERT has such a small MLP (same exact ratio)

Also, did you get rid of value embeddings? I didn't see them when skimming your code yesterday. Really curious if you found they are hampering performance at all.

No they actually help performance a bit. But they're there because modded-nanogpt measures "active parameters per token", meaning you can have unlimited embeddings for nearly free per the speedrunning rules. The result is modded-nanogpt is ~120M parameters when you ignore the embeddings, but in actuality it is ~350M parameters.

I removed them and tied the lm_head to the input embeddings to reduce the parameter count.

However, you should keep them in this repo because you only have 33 tokens (compared to kberts ~50,000), making your value embeddings / untied input embeddings trivial.

lhallee Jan 6, 2025
Maintainer

No they actually help performance a bit. But they're there because modded-nanogpt measures "active parameters per token", meaning you can have unlimited embeddings for nearly free per the speedrunning rules. The result is modded-nanogpt is ~120M parameters when you ignore the embeddings, but in actuality it is ~350M parameters.

Very interesting, thanks for the info.

However, you should keep them in this repo because you only have 33 tokens (compared to kberts ~50,000), making your value embeddings / untied input embeddings trivial.

Thank you for the advice, very helpful!

Really useful to know! Explains why ModernBERT has such a small MLP (same exact ratio)

I hadn't noticed this. This 1152 hidden dimension (or intermediate) is showing up in a lot of models lately (ModernBERT, ESMC-600, for example). Seemingly independently as well. I wonder if there's something convergent here in terms of expressiveness, GPU throughput, and the current amount of data.

Notes / Suggestions from Adapting ESM2 Code #10

Uh oh!

Uh oh!

lapp0 Jan 4, 2025 Collaborator

qkv

10 step warmup

LERPing replace/keep ratio

Padded Dataloader Bugs

Replies: 2 comments · 3 replies

Uh oh!

Uh oh!

lhallee Jan 5, 2025 Maintainer

Uh oh!

lapp0 Jan 6, 2025 Collaborator Author

Shuffling

Dataloader Warning?

FYI

Uh oh!

Uh oh!

lhallee Jan 6, 2025 Maintainer

Uh oh!

Uh oh!

lapp0 Jan 6, 2025 Collaborator Author

Uh oh!

lhallee Jan 6, 2025 Maintainer

lapp0
Jan 4, 2025
Collaborator

Replies: 2 comments 3 replies

lhallee
Jan 5, 2025
Maintainer

lapp0 Jan 6, 2025
Collaborator Author

lhallee
Jan 6, 2025
Maintainer

lapp0 Jan 6, 2025
Collaborator Author

lhallee Jan 6, 2025
Maintainer