Position Encoding Extension Brainstorm #6
Replies: 3 comments 2 replies
-
Relative positional encoding is one idea. I suspect it would require implementing an AttentionMethod class rather than what's been done previously for position encoding in the package. |
Beta Was this translation helpful? Give feedback.
-
I know various techniques have been developed to somehow account for and extend context window length for rotary attention. The terms position interpolation (see here) and adjusted base frequency (see here) have come up in my reading (taking terms from here). It's worth looking further into, see how they could be implemented in AttentionSmithy. |
Beta Was this translation helpful? Give feedback.
-
I thought I'd experiment with ChatGPT's new deep research feature for this topic. Here's a report it spit out. The references did not carry over, but for idea generation it was a pretty good report. Additional Positional Encoding Strategies for TransformersTransformer models require positional encodings to introduce order information, since self-attention alone is permutation-invariant. Beyond the standard sinusoidal and learned absolute encodings, as well as rotary and ALiBi methods, researchers have proposed many other strategies. Below is a comprehensive list of positional encoding techniques, with notes on when each might be useful and key trade-offs to consider. Absolute Positional Encodings (Fixed or Learned)Sinusoidal Absolute EncodingIntroduced in the original Transformer, this uses fixed sinusoid functions (varying frequencies) to encode positions. It has no learned parameters and in principle allows extrapolation to sequence lengths not seen during training (since you can compute sin/cos for any position). This method is simple and works well for in-distribution lengths, but absolute encodings generally struggle to generalize to much longer sequences. In practice, performance can degrade if the model is used beyond the maximum length seen in training. Learned Absolute EmbeddingsUsed in models like BERT, these assign a trainable vector to each position index. The model can directly learn task-specific positional patterns. However, the encoding is fixed to the maximum sequence length of training and cannot generalize to longer sequences (positions beyond the learned range have no embedding unless you extend the table arbitrarily). Performance drops if input order differs significantly from training scenarios. Time-Aware Absolute Encoding (tAPE)A variation designed for time-series, tAPE modifies the sinusoidal formula by incorporating the total sequence length into the frequency terms. This adjustment preserves “distance awareness” even in low-dimensional embeddings, yielding smoother differences between positions. Positional Interpolation (for Pretrained Models)This is a post-hoc technique rather than a new encoding type. It allows using a pretrained model at longer sequences than it was trained on by interpolating or scaling down position indices. For example, one can interpolate rotary embeddings so that a model trained with max length N can be used for length >N by compressing the positional indices into the original range . The key idea is to reduce or normalize new position indices to align with the range seen in training, mitigating the issue of unseen large indices . Relative Positional Encodings (Content-Independent)Rather than encoding absolute positions, relative positional encodings represent the distance or offset between sequence elements. These often yield better length generalization , since the model focuses on relative order. Common approaches include adding learned biases or embeddings based on the pairwise distance i-j between tokens: Shaw et al.’s Additive Relative EncodingThis method extends self-attention to include representations of relative distance between tokens . Each possible relative offset (within a range) has an embedding; during attention, a bias is added to the query-key score based on the relative position. Shaw et al. showed this improved MT (machine translation) performance over absolute encoding and that combining relative+absolute gave no extra benefit . T5-Style Relative Position BucketsThe Text-To-Text Transfer Transformer (T5) uses a form of relative encoding with log-scaled distance buckets. Instead of a unique embedding for every possible offset, distances are grouped into buckets (e.g. “within 1, 2, 4, 8, … tokens” up to a limit) . A learned bias for each bucket is added to attention scores. This saves parameters and lets the model treat very large distances as “all roughly the same” beyond a certain point . ALiBi (Attention Linear Bias)ALiBi is a simple relative bias that doesn’t use embeddings at all – it adds a fixed, non-learned penalty proportional to the distance i-j for each head . Specifically, each attention head has a predetermined slope, and for any query-key pair a bias = –(distance)×(slope) is added . This creates an attention bias toward nearer tokens (a recency bias). The key benefit is that this bias extends arbitrarily – you can use a model on longer sequences than seen in training, and the same linear penalty applies (no new parameters needed) . Efficient Relative Position Encoding (eRPE)Proposed for time-series classification, eRPE adds the relative positional bias after the softmax in attention (rather than before) . This “post-softmax” bias effectively sharpens the attention distribution by highlighting relative positions once the base attention weights are computed . Practically, eRPE is implemented by maintaining a trainable vector of biases for each possible distance and adding the bias to the output of the attention softmax . Rotary and Hybrid Position EncodingsRotary Position Embeddings (RoPE)This method (Su et al. 2021) encodes positions by rotating the query and key vectors in multi-dimensional space . Each pair of embedding dimensions forms a 2D plane in which the vectors are rotated by an angle proportional to the token’s position. This way, dot-products between queries and keys naturally incorporate relative position information via phase alignment . RoPE has been used in some large language models (e.g. GPT-Neo/GPT-J) as it can be applied on the fly to any sequence length. It effectively blends absolute and relative information (by encoding absolute position as a rotation, while the difference between rotations of two tokens encodes their relative offset) . Extrapolatable (xPos) and Other Rotary VariantsTo address RoPE’s extrapolation limits, variants like xPos have been proposed. xPos (short for “extrapolatable position embedding”) adjusts the rotary formulation to preserve length generalization more faithfully . In general, these methods modify the base rotation or rescale it so that distance relationships remain consistent beyond the original context window. Another practical trick is Rotary Interpolation (RI), as mentioned above, which rescales position indices during inference . “Untied” or Disentangled Position RepresentationsTraditional absolute encoding adds the position vector to the token embedding, entangling content and position. Transformer with Untied Positional Encoding (TUPE) instead separates these concerns – it does not add positions to token embeddings. Instead, positional information is injected by modifying the attention score with separate terms . For example, TUPE computes attention as a sum of a content-content term and a position-position term . This way, the model learns content correlations and positional correlations independently, preventing them from interfering with each other . Another model, DeBERTa , uses a similar idea: each token is represented by a content vector and a position vector, and attention weights are the combination of content-to-content attention and content-to-position interactions . Advanced Strategies for Long Contexts and ExtrapolationA number of recent research proposals aim to allow Transformers to generalize to longer sequences than seen in training by designing better positional functions: FIRE (Functional Interpolation of Relative Embeddings)This method (ICLR 2024) learns a continuous function to generate relative position biases, rather than using fixed formulas or lookup tables. Specifically, FIRE uses a small neural network (e.g. a 2-layer MLP) that takes a normalized relative distance as input and outputs a bias for that distance . During training, distances are normalized by the sequence length (progressive interpolation) so that the network can extrapolate to larger sequences (inputs beyond training length map to values slightly above 1.0) . The authors also use a concave monotonic function (like a learned log scale) to emphasize small distances and compress large ones . In essence, FIRE can approximate or even represent other relative schemes (they show it can emulate T5’s bias, ALiBi, etc.) and then go beyond them . KERPLE (Kernelized Relative Positional Embedding)KERPLE (NeurIPS 2022) takes a theoretical approach by using kernels. It treats the relative position function as a kernel function that ideally would generalize distances. By using conditionally positive definite kernels (which generalize distance metrics) and then converting them to positive definite form, KERPLE produces relative position embeddings that are more suited to extrapolation . Intuitively, this means the positional differences are encoded in a way that remains well-behaved as distances grow. The method allows a family of kernel-based RPEs, providing a principled way to achieve length extrapolation . CAPE (Context-Adaptive Positional Encoding)Instead of keeping positional encodings fixed after training, CAPE (2024 preprint) makes them dynamic at inference. The idea is to adjust positional embeddings based on the input sequence’s content or context . CAPE combines learned fixed priors (like a base positional embedding) with a contextual adjustment that is computed from the sequence itself . This means two different sequences of the same length might end up with slightly different position encodings, tuned to their content. In experiments, CAPE allowed a model trained on length 128 to generalize and perform well on length 8192 at test , significantly outpacing static methods. Conditional and Domain-Specific EncodingsPositional encodings can also be tailored to specific modalities or made conditional on the input structure: Conditional Positional Encodings (PEG/CPE for Vision)In vision transformers, absolute position embeddings (learned or sinusoidal) can limit input resolution flexibility – e.g., a ViT trained on 224×224 images can’t naturally handle 384×384 images with a learned 1D position for each patch. Conditional Positional Encoding (CPE) addresses this by generating positional encodings dynamically from the data . One implementation (PEG) applies a small convolution to the patch features to produce a position-dependent bias . Because convolution is translation-equivariant, the model achieves translation invariance and can seamlessly handle larger images or longer sequences than seen in training . Essentially, the position encoding becomes a function of the local neighborhood of each token rather than a fixed vector lookup. Multi-Dimensional (Axial) EncodingsFor data with inherent multi-dimensional structure like images or videos, one common strategy is to use separate position encodings per dimension. For example, one can assign a row embedding and a column embedding for an image patch, and then add them to get the full positional encoding. This axial encoding was used in some early image transformers and allows models to easily generalize to different aspect ratios by mixing and matching row/col positions. Similarly, 3D data (like a sequence of images or a spatial-temporal grid) might use separate encodings for time and space. Graph Positional EncodingsIn transformer-based graph neural networks, “position” refers to a node’s position in the graph structure rather than a sequence index. One popular strategy is Laplacian Eigenvector PE, which computes the top-$k$ eigenvectors of the graph Laplacian to serve as coordinates for nodes . This provides each node with a unique vector based on graph structure (analogous to how sinusoidal gives a unique vector per index). It’s essentially a generalization of sinusoidal encodings to arbitrary graphs . Other graph PEs include using random walk features or distances from certain landmark nodes . In a graph transformer, these encodings are added to the node feature embeddings so that self-attention can be aware of graph positions. Other Noteworthy Approaches and ConsiderationsConvolution-Based Encodings (convSPE)Some methods generate positional information via convolution operations. For example, Convolutional Stochastic Positional Encoding uses a random convolution kernel applied to sequence tokens to produce position-aware features . The sliding window nature of convolution gives an implicit relative position effect (nearby tokens get similar convolutional context). Liutkus et al. (2021) introduced a convSPE that achieves linear time complexity and approximates a desired positional kernel in expectation . Spline-Based Positional EncodingA very different idea is to eliminate fixed positional embeddings and instead treat the sequence as points along a continuous path (spline) in latent space . Spline-based Transformers (ECCV 2024) generate a smooth trajectory through latent space that passes through the token embeddings in order . The position of a token is implicitly encoded by where it lies on this trajectory. This method was shown to handle length extrapolation well (since one can extend the spline) and even allows user control of the sequence by manipulating spline control points . No Positional Encoding (Implicit Position Learning)Interestingly, a few works have investigated transformers without any explicit positional encoding. In a decoder-only (causal) transformer, the autoregressive masking provides a weak sense of order (each token can only attend to earlier ones) – this breaks permutation symmetry enough that the model might infer positions internally. Indeed, some language models without position embeddings still achieve competitive performance, and they can generalize in-distribution surprisingly well . One study found that explicit position embeddings are not essential for decoder-only Transformers to generalize to longer sequences in language modeling . The network can internally learn positional cues (perhaps via parameters in the attention layers) up to a point. ReferencesThe strategies above are drawn from a mix of foundational work and recent research. For instance, Vaswani et al. (2017) introduced sinusoidal encodings ; Shaw et al. (2018) proposed relative position representations ; Devlin et al. (2019) used learned embeddings in BERT. More recent methods like RoPE , ALiBi , TUPE , and DeBERTa have pushed performance, while FIRE , KERPLE , and CAPE focus on length generalization. Vision and graph models brought forth CPE and Laplacian eigenvector encodings . Each method involves trade-offs in complexity, generalization, and domain applicability, so the choice should align with the task requirements – e.g. use simple absolute encodings for fixed-length scenarios, relative or bias-based encodings for better generalization, and advanced or adaptive methods when pushing the limits of sequence length or when working with unusual data modalities. The continuing evolution of positional encoding research aims to make Transformers even more flexible across diverse tasks and input structures. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The original implementation of AttentionSmithy included 4 positional embedding strategies (sinusoidal, learning, rotary, ALiBi). It would be great to extend this to other strategies that have been proposed, so making this discussion in case anyone has specific ideas.
Beta Was this translation helpful? Give feedback.
All reactions