You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using ESMC to generate embeddings, two additional tokens are added. Suppose the input sequence has length “seq_len” , then the shape of the embedding generated by ESMC is
(1, seq_len + 2, 960) for esmc-300m-2024-12, and
(1, seq_len + 2, 1152) for esmc-600m-2024-12.
Based on my attempts, I couldn't find an explanation for these two extra tokens in the official documentation.
Currently, my approach is: if I want to get per-residue embeddings, I remove the first and last token embeddings, so that the length matches the number of residues (sequence length). This way, I can obtain an embedding for each residue (to build protein graph node features).
Is there any official explanation for the presence of these two extra embeddings?