I have question in position_embedding!!! #3
Replies: 1 comment
-
The authors of the Vision Transformer paper [1] compared the performance of a 1-D learnable embedding vs a 2-D encoding method and found no significant difference (Section 3.1). I have tested different positional encodings myself and found learnable and sinusoidal (original) positional encodings get similar results, whereas the recent advanced methods can achieve superior results. You can find the implementation and results at https://github.com/s-chh/2D-Positional-Encoding-Vision-Transformer. [1] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Why does the model opt to add a zero tensor that is involved in the learning process during the position embedding stage, rather than directly incorporating tensors that are already well-encoded and may not be subject to further learning later on
Beta Was this translation helpful? Give feedback.
All reactions