You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We were recently testing the HuggingFace implementation of ESMC-300 for protein property prediction (classification/regression). We observed two issues with protein embedding quality and would be happy to hear your opinion.
(PS: I assume the ESM implementation we used mimics your implementation, but I opened the discussion here as the community seems more active.)
1) Attention bias in cls/start token
Default protein representation in ESM is concatenation of <cls> token (start of sequence) and mean-pooled encodings of individual amino acids.
However, we observed that (as expected) the <cls> token attends much more strongly to amino acids at the start of the sequence.
This is an issue as the sequence representation is much more sensitive towards changes at the start of the sequence.
We also observed this bias in subsequent property prediction. We mutated individual positions with all possible amino acids and predicted property scores for sequences. The distribution of property scores was much more volatile at the start of the sequences than in the following regions (across different properties we tested). The below image shows property scores (y-axis) for sequences with individual mutations introduced along protein length (x-axis) across all amino acids (min/max/mean are shown as differently colored dots).
As a first attempt to mitigate this we tried removing <cls> token from protein representation for property prediction. However, this importantly reduced property prediction performance, giving similar training dynamics as observed when using concatenated protein pairs (see below).
If anyone has an idea for a highly-informative sequence representation without sequence position bias we would really appreciate it.
2) Embedding protein pairs
The implementation of ESM we used does embedding of protein pairs (e.g., for protein interaction prediction) via: encoder(<cls>seq1<eos>seq2<eos>)
However, we observed that such embeddings performed poorly on the downstream task of predicting protein pair properties (see below). Instead, encoding both proteins individually and taking the difference of the two as features/representations for property prediction performed much better. We have also seen this strategy in a few other PLM papers from the last year. diff(encoder(<cls>seq1<eos>) – encoder(<cls>seq2<eos))
In particular, when using the difference-based representation:
Training loss gets lower
Validation losses have clearer dynamics (in our case we overfit wrt 3/5 validation splits and in 2/5 we may see some generalization as the loss actually decreases during training)
Validation loss on real rather than randomly perturbed sequences is much smaller. - As a "negative control" we randomly shuffled individual amino acids within validation sequences (before concatenation for the concatenated approach). - Losses of shuffled sequences are shown as doted lines.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
We were recently testing the HuggingFace implementation of ESMC-300 for protein property prediction (classification/regression). We observed two issues with protein embedding quality and would be happy to hear your opinion.
(PS: I assume the ESM implementation we used mimics your implementation, but I opened the discussion here as the community seems more active.)
1) Attention bias in cls/start token
Default protein representation in ESM is concatenation of
<cls>
token (start of sequence) and mean-pooled encodings of individual amino acids.However, we observed that (as expected) the

<cls>
token attends much more strongly to amino acids at the start of the sequence.This is an issue as the sequence representation is much more sensitive towards changes at the start of the sequence.
We also observed this bias in subsequent property prediction. We mutated individual positions with all possible amino acids and predicted property scores for sequences. The distribution of property scores was much more volatile at the start of the sequences than in the following regions (across different properties we tested). The below image shows property scores (y-axis) for sequences with individual mutations introduced along protein length (x-axis) across all amino acids (min/max/mean are shown as differently colored dots).

As a first attempt to mitigate this we tried removing
<cls>
token from protein representation for property prediction. However, this importantly reduced property prediction performance, giving similar training dynamics as observed when using concatenated protein pairs (see below).If anyone has an idea for a highly-informative sequence representation without sequence position bias we would really appreciate it.
2) Embedding protein pairs
The implementation of ESM we used does embedding of protein pairs (e.g., for protein interaction prediction) via:

encoder(<cls>seq1<eos>seq2<eos>)
However, we observed that such embeddings performed poorly on the downstream task of predicting protein pair properties (see below). Instead, encoding both proteins individually and taking the difference of the two as features/representations for property prediction performed much better. We have also seen this strategy in a few other PLM papers from the last year.

diff(encoder(<cls>seq1<eos>) – encoder(<cls>seq2<eos))
In particular, when using the difference-based representation:
Beta Was this translation helpful? Give feedback.
All reactions