Why does the embedding generated by ESMC have two more tokens than the sequence length?

When using ESMC to generate embeddings, two additional tokens are added. Suppose the input sequence has length “seq_len” , then the shape of the embedding generated by ESMC is
(1, seq_len + 2, 960) for esmc-300m-2024-12, and 
(1, seq_len + 2, 1152) for esmc-600m-2024-12. 
Based on my attempts, I couldn't find an explanation for these two extra tokens in the official documentation.

Currently, my approach is: if I want to get per-residue embeddings, I remove the first and last token embeddings, so that the length matches the number of residues (sequence length). This way, I can obtain an embedding for each residue (to build protein graph node features). 
Is there any official explanation for the presence of these two extra embeddings?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does the embedding generated by ESMC have two more tokens than the sequence length? #241

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why does the embedding generated by ESMC have two more tokens than the sequence length? #241

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions