Improving protein embedding/representation for downstream tasks #243

Hrovatin · 2025-05-22T11:11:10Z

Hrovatin
May 22, 2025

We were recently testing the HuggingFace implementation of ESMC-300 for protein property prediction (classification/regression). We observed two issues with protein embedding quality and would be happy to hear your opinion.

(PS: I assume the ESM implementation we used mimics your implementation, but I opened the discussion here as the community seems more active.)

1) Attention bias in cls/start token

Default protein representation in ESM is concatenation of <cls> token (start of sequence) and mean-pooled encodings of individual amino acids.

However, we observed that (as expected) the <cls> token attends much more strongly to amino acids at the start of the sequence.

This is an issue as the sequence representation is much more sensitive towards changes at the start of the sequence.

We also observed this bias in subsequent property prediction. We mutated individual positions with all possible amino acids and predicted property scores for sequences. The distribution of property scores was much more volatile at the start of the sequences than in the following regions (across different properties we tested). The below image shows property scores (y-axis) for sequences with individual mutations introduced along protein length (x-axis) across all amino acids (min/max/mean are shown as differently colored dots).

As a first attempt to mitigate this we tried removing <cls> token from protein representation for property prediction. However, this importantly reduced property prediction performance, giving similar training dynamics as observed when using concatenated protein pairs (see below).

If anyone has an idea for a highly-informative sequence representation without sequence position bias we would really appreciate it.

2) Embedding protein pairs

The implementation of ESM we used does embedding of protein pairs (e.g., for protein interaction prediction) via:
encoder(<cls>seq1<eos>seq2<eos>)

However, we observed that such embeddings performed poorly on the downstream task of predicting protein pair properties (see below). Instead, encoding both proteins individually and taking the difference of the two as features/representations for property prediction performed much better. We have also seen this strategy in a few other PLM papers from the last year.
diff(encoder(<cls>seq1<eos>) – encoder(<cls>seq2<eos))

In particular, when using the difference-based representation:

Training loss gets lower
Validation losses have clearer dynamics (in our case we overfit wrt 3/5 validation splits and in 2/5 we may see some generalization as the loss actually decreases during training)
Validation loss on real rather than randomly perturbed sequences is much smaller. - As a "negative control" we randomly shuffled individual amino acids within validation sequences (before concatenation for the concatenated approach). - Losses of shuffled sequences are shown as doted lines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving protein embedding/representation for downstream tasks #243

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Improving protein embedding/representation for downstream tasks #243

Uh oh!

Hrovatin May 22, 2025

1) Attention bias in cls/start token

2) Embedding protein pairs

Replies: 0 comments

Hrovatin
May 22, 2025