Longformer, Big Bird implementation #5
Replies: 3 comments
-
Big Bird and Longformer both have a distinction in choosing "global" tokens that attend to all other tokens (and gets attended to in turn). It was unclear to me how these were chosen, and in my coding efforts have assumed that specific individual tokens are given global status, selected in advance by the user. However, I have wondered if global token selection could be learned by the model itself. Something to consider in future implementation efforts. |
Beta Was this translation helpful? Give feedback.
-
I would also be interested to see if these implementations could be mixed with Hyena operators, possibly mutually reinforcing each other. |
Beta Was this translation helpful? Give feedback.
-
Longformer (for encoders, so far) was implemented in #44. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I actually started a Big Bird implementation already (see
src/attention_smithy/attention/BigBirdAttention.py
). It has a full test suite and technically performs as it should. BUT I realized during real-life application that my variation of their approach - direct indexing rather than using thegather
function - does not play well with large data samples. Thus, the entire point of the implementation was rendered moot.I'd like to rewrite this at some point to use the
gather
function, but that would require extensive rewrites. If anyone has any thoughts, let me know.Longformer employs similar principles, so co-development would probably be a good idea.
Beta Was this translation helpful? Give feedback.
All reactions