Position Encoding Extension Brainstorm #6

CCranney · 2025-02-13T18:42:22Z

CCranney
Feb 13, 2025
Maintainer

The original implementation of AttentionSmithy included 4 positional embedding strategies (sinusoidal, learning, rotary, ALiBi). It would be great to extend this to other strategies that have been proposed, so making this discussion in case anyone has specific ideas.

CCranney · 2025-02-13T18:45:05Z

CCranney
Feb 13, 2025
Maintainer Author

Relative positional encoding is one idea. I suspect it would require implementing an AttentionMethod class rather than what's been done previously for position encoding in the package.

0 replies

CCranney · 2025-02-26T17:25:43Z

CCranney
Feb 26, 2025
Maintainer Author

I know various techniques have been developed to somehow account for and extend context window length for rotary attention. The terms position interpolation (see here) and adjusted base frequency (see here) have come up in my reading (taking terms from here). It's worth looking further into, see how they could be implemented in AttentionSmithy.

0 replies

CCranney · 2025-02-26T23:47:41Z

CCranney
Feb 26, 2025
Maintainer Author

I thought I'd experiment with ChatGPT's new deep research feature for this topic. Here's a report it spit out. The references did not carry over, but for idea generation it was a pretty good report.

Additional Positional Encoding Strategies for Transformers

Transformer models require positional encodings to introduce order information, since self-attention alone is permutation-invariant. Beyond the standard sinusoidal and learned absolute encodings, as well as rotary and ALiBi methods, researchers have proposed many other strategies. Below is a comprehensive list of positional encoding techniques, with notes on when each might be useful and key trade-offs to consider.

Absolute Positional Encodings (Fixed or Learned)

Sinusoidal Absolute Encoding

Introduced in the original Transformer, this uses fixed sinusoid functions (varying frequencies) to encode positions. It has no learned parameters and in principle allows extrapolation to sequence lengths not seen during training (since you can compute sin/cos for any position). This method is simple and works well for in-distribution lengths, but absolute encodings generally struggle to generalize to much longer sequences. In practice, performance can degrade if the model is used beyond the maximum length seen in training.
Useful for: Original Transformer setups or when a deterministic, parameter-free encoding is desired.
Trade-offs: No training cost for positions, but limited adaptability – the model may not effectively utilize positions beyond training range despite the continuous nature of sinusoids.

Learned Absolute Embeddings

Used in models like BERT, these assign a trainable vector to each position index. The model can directly learn task-specific positional patterns. However, the encoding is fixed to the maximum sequence length of training and cannot generalize to longer sequences (positions beyond the learned range have no embedding unless you extend the table arbitrarily). Performance drops if input order differs significantly from training scenarios.
Useful for: Tasks with fixed or bounded sequence lengths (e.g. sentence classification) where the model can learn positional nuances from data.
Trade-offs: Increases parameter count linearly with max sequence length. Cannot extrapolate to longer sequences without retraining or ad-hoc fixes. Simpler to implement (just an embedding lookup) but poor length generalization compared to relative methods.

Time-Aware Absolute Encoding (tAPE)

A variation designed for time-series, tAPE modifies the sinusoidal formula by incorporating the total sequence length into the frequency terms. This adjustment preserves “distance awareness” even in low-dimensional embeddings, yielding smoother differences between positions.
Useful for: Time-series or any application with relatively low embedding dimensionality or where preserving absolute distance fidelity is important .
Trade-offs: Still an absolute method (fixed after training), but more robust than vanilla sinusoids for certain dimensional settings. It requires knowing or assuming a sequence length during encoding computation .

Positional Interpolation (for Pretrained Models)

This is a post-hoc technique rather than a new encoding type. It allows using a pretrained model at longer sequences than it was trained on by interpolating or scaling down position indices. For example, one can interpolate rotary embeddings so that a model trained with max length N can be used for length >N by compressing the positional indices into the original range . The key idea is to reduce or normalize new position indices to align with the range seen in training, mitigating the issue of unseen large indices .
Useful for: Extending the context window of pretrained language models (like GPTs) without retraining.
Trade-offs: This can sacrifice some positional resolution or accuracy (since multiple actual positions may map to similar encodings after interpolation). It is a heuristic that enables longer inputs with minimal change, but not an architecturally learned solution.

Relative Positional Encodings (Content-Independent)

Rather than encoding absolute positions, relative positional encodings represent the distance or offset between sequence elements. These often yield better length generalization , since the model focuses on relative order. Common approaches include adding learned biases or embeddings based on the pairwise distance i-j between tokens:

Shaw et al.’s Additive Relative Encoding

This method extends self-attention to include representations of relative distance between tokens . Each possible relative offset (within a range) has an embedding; during attention, a bias is added to the query-key score based on the relative position. Shaw et al. showed this improved MT (machine translation) performance over absolute encoding and that combining relative+absolute gave no extra benefit .
Useful for: Sequence tasks like translation or language modeling where the model should naturally generalize to longer sequences or focus on pairwise distances (e.g., “this word is 3 tokens ahead of that word”).
Trade-offs: Increases memory usage if naively storing all pairwise distances. Common implementations cap the relative range or use efficient computations. Not inherently limited to a fixed max length (beyond the range of relative embeddings), but usually one still defines a maximum relative distance bucket.

T5-Style Relative Position Buckets

The Text-To-Text Transfer Transformer (T5) uses a form of relative encoding with log-scaled distance buckets. Instead of a unique embedding for every possible offset, distances are grouped into buckets (e.g. “within 1, 2, 4, 8, … tokens” up to a limit) . A learned bias for each bucket is added to attention scores. This saves parameters and lets the model treat very large distances as “all roughly the same” beyond a certain point .
Useful for: Long text tasks where very distant tokens can be treated similarly (common in long documents). It balances granularity and capacity by focusing on small relative distances and coarsening large ones.
Trade-offs: The bucket scheme introduces a hyperparameter (bucket sizes) and a maximum bucket range. Positions beyond the largest bucket are indistinguishable (all coded as “far apart”). Still, the method is effective and was used in a state-of-the-art pretrained model .

ALiBi (Attention Linear Bias)

ALiBi is a simple relative bias that doesn’t use embeddings at all – it adds a fixed, non-learned penalty proportional to the distance i-j for each head . Specifically, each attention head has a predetermined slope, and for any query-key pair a bias = –(distance)×(slope) is added . This creates an attention bias toward nearer tokens (a recency bias). The key benefit is that this bias extends arbitrarily – you can use a model on longer sequences than seen in training, and the same linear penalty applies (no new parameters needed) .
Useful for: Language models where longer context at inference is needed, as ALiBi has shown strong length extrapolation (handling 2–4× training sequence length with minimal loss) . It’s also extremely easy to implement and adds virtually no memory overhead.
Trade-offs: Because it imposes a monotonic decreasing attention weight with distance, it could hurt tasks requiring equal or long-range attention. ALiBi’s bias is fixed – it cannot learn task-specific position patterns. It works best in causal/decoder models; in bidirectional use it might need adaptation.

Efficient Relative Position Encoding (eRPE)

Proposed for time-series classification, eRPE adds the relative positional bias after the softmax in attention (rather than before) . This “post-softmax” bias effectively sharpens the attention distribution by highlighting relative positions once the base attention weights are computed . Practically, eRPE is implemented by maintaining a trainable vector of biases for each possible distance and adding the bias to the output of the attention softmax .
Useful for: Scenarios like time-series or signals classification where specific relative positions (e.g. periodic patterns) are crucial – adding bias later can emphasize those without disturbing the raw attention alignment.
Trade-offs: Unusual placement of bias means it’s not trivially compatible with all transformer implementations. It’s a specialized tweak that showed improved classification accuracy in certain domains , but not a broadly adopted practice in NLP as of yet.

Rotary and Hybrid Position Encodings

Rotary Position Embeddings (RoPE)

This method (Su et al. 2021) encodes positions by rotating the query and key vectors in multi-dimensional space . Each pair of embedding dimensions forms a 2D plane in which the vectors are rotated by an angle proportional to the token’s position. This way, dot-products between queries and keys naturally incorporate relative position information via phase alignment . RoPE has been used in some large language models (e.g. GPT-Neo/GPT-J) as it can be applied on the fly to any sequence length. It effectively blends absolute and relative information (by encoding absolute position as a rotation, while the difference between rotations of two tokens encodes their relative offset) .
Useful for: Long context language models where we want to avoid fixed embeddings and allow extrapolation. RoPE has no explicit max length (angles just keep increasing), and it tends to be numerically stable and memory-efficient (no extra vectors, just a rotation matrix).
Trade-offs: Although extrapolatable in theory, RoPE can suffer performance loss when used far beyond trained lengths – recent studies noted it “doesn’t generalize well” if the sequence is double the training length . It also introduces a periodic structure (since rotations eventually repeat angles modulo 2π), which could be problematic if a model somehow encounters a sequence longer than the effective period (usually extremely large). Overall, RoPE is powerful but may require careful re-tuning for significantly longer sequences.

Extrapolatable (xPos) and Other Rotary Variants

To address RoPE’s extrapolation limits, variants like xPos have been proposed. xPos (short for “extrapolatable position embedding”) adjusts the rotary formulation to preserve length generalization more faithfully . In general, these methods modify the base rotation or rescale it so that distance relationships remain consistent beyond the original context window. Another practical trick is Rotary Interpolation (RI), as mentioned above, which rescales position indices during inference .
Useful for: Extending or fine-tuning models like LLaMA-2 to longer contexts without retraining from scratch .
Trade-offs: These approaches are largely engineering fixes – e.g., xPos introduces an additional mechanism to make RoPE more robust (often by analytically deriving scaling factors), and interpolation is a manual adjustment. They may not perfectly maintain model accuracy at extreme lengths, but they offer a cheap solution for length extrapolation. Empirically, they help mitigate the sharp drop that vanilla RoPE can show beyond its trained length .

“Untied” or Disentangled Position Representations

Traditional absolute encoding adds the position vector to the token embedding, entangling content and position. Transformer with Untied Positional Encoding (TUPE) instead separates these concerns – it does not add positions to token embeddings. Instead, positional information is injected by modifying the attention score with separate terms . For example, TUPE computes attention as a sum of a content-content term and a position-position term . This way, the model learns content correlations and positional correlations independently, preventing them from interfering with each other . Another model, DeBERTa , uses a similar idea: each token is represented by a content vector and a position vector, and attention weights are the combination of content-to-content attention and content-to-position interactions .
Useful for: Scenarios where very precise understanding of word order is needed (e.g. parsing, or complex reasoning tasks). TUPE showed improvements in language model pre-training, achieving higher downstream accuracy than standard BERT with the same compute . DeBERTa achieved state-of-the-art on NLP benchmarks by using this disentangled scheme .
Trade-offs: These methods increase complexity – each token now has multiple representations and attention formulas are modified. Memory usage can grow (since you may carry an extra position embedding per token, and more attention computations). They are also less tested in very long contexts. Implementation is non-trivial because frameworks typically assume one combined embedding. However, the performance gains indicate it’s a worthwhile strategy if training from scratch.

Advanced Strategies for Long Contexts and Extrapolation

A number of recent research proposals aim to allow Transformers to generalize to longer sequences than seen in training by designing better positional functions:

FIRE (Functional Interpolation of Relative Embeddings)

This method (ICLR 2024) learns a continuous function to generate relative position biases, rather than using fixed formulas or lookup tables. Specifically, FIRE uses a small neural network (e.g. a 2-layer MLP) that takes a normalized relative distance as input and outputs a bias for that distance . During training, distances are normalized by the sequence length (progressive interpolation) so that the network can extrapolate to larger sequences (inputs beyond training length map to values slightly above 1.0) . The authors also use a concave monotonic function (like a learned log scale) to emphasize small distances and compress large ones . In essence, FIRE can approximate or even represent other relative schemes (they show it can emulate T5’s bias, ALiBi, etc.) and then go beyond them .
Useful for: Situations where you expect to deploy a model on longer inputs than it was trained on. FIRE explicitly targets length generalization, and indeed models using FIRE have better zero-shot performance on long text than those with static encodings .
Trade-offs: FIRE introduces a small neural module for positions – this is a tiny overhead in terms of parameters, but it does complicate the encoding pipeline slightly. Also, it assumes a predictable way to normalize positions; if you suddenly feed a length far beyond what was seen, the network’s input normalization (e.g. i/maximum) might still be in a regime it hasn’t learned. Overall, though, it’s reported as a robust improvement for long sequences.

KERPLE (Kernelized Relative Positional Embedding)

KERPLE (NeurIPS 2022) takes a theoretical approach by using kernels. It treats the relative position function as a kernel function that ideally would generalize distances. By using conditionally positive definite kernels (which generalize distance metrics) and then converting them to positive definite form, KERPLE produces relative position embeddings that are more suited to extrapolation . Intuitively, this means the positional differences are encoded in a way that remains well-behaved as distances grow. The method allows a family of kernel-based RPEs, providing a principled way to achieve length extrapolation .
Useful for: Long-range sequences where a mathematically grounded guarantee of extrapolation is desired (e.g., very long texts or DNA sequences).
Trade-offs: KERPLE’s approach is more complex and may be harder to implement from scratch – one needs to select or tune the kernel function. In practice, simpler methods like ALiBi might achieve similar goals with less complexity, but KERPLE offers a deeper understanding and could yield benefits in edge cases (it’s a more “continuous” solution than discrete bucketed biases, for instance).

CAPE (Context-Adaptive Positional Encoding)

Instead of keeping positional encodings fixed after training, CAPE (2024 preprint) makes them dynamic at inference. The idea is to adjust positional embeddings based on the input sequence’s content or context . CAPE combines learned fixed priors (like a base positional embedding) with a contextual adjustment that is computed from the sequence itself . This means two different sequences of the same length might end up with slightly different position encodings, tuned to their content. In experiments, CAPE allowed a model trained on length 128 to generalize and perform well on length 8192 at test , significantly outpacing static methods.
Useful for: Extremely long text or other data where the distribution of content over positions might vary (for example, a model that sometimes needs to focus on local position patterns and other times on global position, depending on context). It’s also useful when training on shorter sequences due to resource limits, but deployment requires much longer sequences .
Trade-offs: This approach blurs the line between “position” and “content,” which can complicate analysis. Implementation likely involves an extra attention or gating mechanism to modify positions on the fly, which adds runtime cost. Also, because it’s adaptive, one must be careful to ensure the model doesn’t learn to distort positional information in unintended ways given unusual inputs. It’s a cutting-edge idea and not yet common in real-world systems.

Conditional and Domain-Specific Encodings

Positional encodings can also be tailored to specific modalities or made conditional on the input structure:

Conditional Positional Encodings (PEG/CPE for Vision)

In vision transformers, absolute position embeddings (learned or sinusoidal) can limit input resolution flexibility – e.g., a ViT trained on 224×224 images can’t naturally handle 384×384 images with a learned 1D position for each patch. Conditional Positional Encoding (CPE) addresses this by generating positional encodings dynamically from the data . One implementation (PEG) applies a small convolution to the patch features to produce a position-dependent bias . Because convolution is translation-equivariant, the model achieves translation invariance and can seamlessly handle larger images or longer sequences than seen in training . Essentially, the position encoding becomes a function of the local neighborhood of each token rather than a fixed vector lookup.
Useful for: Vision Transformers and other spatial data transformers where input size or alignment may vary. For example, image classification or detection models benefit from translation invariance and from not being tied to a fixed grid size . It can also be applied to text (conceivably) by using a CNN over token embeddings to induce positional info.
Trade-offs: Requires an extra network component (e.g., a convolution or local aggregator) which slightly increases computation. However, this is usually minor (e.g., a 3×3 conv on feature maps). Another consideration is that it makes positional encoding data-dependent, which could be less interpretable. In practice, though, CPE (as used in CPVT models) improved performance and flexibility in vision with little downside .

Multi-Dimensional (Axial) Encodings

For data with inherent multi-dimensional structure like images or videos, one common strategy is to use separate position encodings per dimension. For example, one can assign a row embedding and a column embedding for an image patch, and then add them to get the full positional encoding. This axial encoding was used in some early image transformers and allows models to easily generalize to different aspect ratios by mixing and matching row/col positions. Similarly, 3D data (like a sequence of images or a spatial-temporal grid) might use separate encodings for time and space.
Useful for: Vision Transformers (images, videos) and any scenario where positions naturally factorizable (height vs width, etc.). It reduces the number of unique position vectors needed (O(H+W) instead of O(H×W) for an HxW image).
Trade-offs: Axial encoding assumes independence of dimensions, which might not capture diagonal positional patterns as directly as a single learned embedding for each position would. Still, the reduced parameter count and flexibility are often worth it. Most modern ViTs either use a flattened 2D sinusoidal scheme or this kind of factorized encoding when not using convolutional encoders.

Graph Positional Encodings

In transformer-based graph neural networks, “position” refers to a node’s position in the graph structure rather than a sequence index. One popular strategy is Laplacian Eigenvector PE, which computes the top-$k$ eigenvectors of the graph Laplacian to serve as coordinates for nodes . This provides each node with a unique vector based on graph structure (analogous to how sinusoidal gives a unique vector per index). It’s essentially a generalization of sinusoidal encodings to arbitrary graphs . Other graph PEs include using random walk features or distances from certain landmark nodes . In a graph transformer, these encodings are added to the node feature embeddings so that self-attention can be aware of graph positions.
Useful for: Graph transformers dealing with molecules, social networks, or any relational data. Including structural position is crucial because unlike sequences or images, a graph’s nodes have no natural ordering – the PE provides a sense of “where” a node is in the graph connectivity.
Trade-offs: Computing eigenvectors can be expensive for large graphs (e.g., it’s an $O(N^3)$ operation for $N$ nodes in worst case). There’s active research on more efficient or learned structural encodings. Also, Laplacian eigenvectors might suffer from sign ambiguity (they are only defined up to a positive/negative flip) and degenerate cases (regular graphs, etc.). Despite this, they have been successfully used to inject inductive bias about graph topology into transformers.

Other Noteworthy Approaches and Considerations

Convolution-Based Encodings (convSPE)

Some methods generate positional information via convolution operations. For example, Convolutional Stochastic Positional Encoding uses a random convolution kernel applied to sequence tokens to produce position-aware features . The sliding window nature of convolution gives an implicit relative position effect (nearby tokens get similar convolutional context). Liutkus et al. (2021) introduced a convSPE that achieves linear time complexity and approximates a desired positional kernel in expectation .
Useful for: Extremely long sequences where standard attention is too slow – convSPE can be part of a linear attention model, injecting position cheaply. It’s also conceptually useful for bridging CNNs and Transformers, showing that a conv filter can play a role similar to a positional bias.
Trade-offs: This approach is less interpretable (position info is baked into filter responses) and may not capture long-range positions as explicitly as learned embeddings. It essentially limits positional scope to the convolution’s receptive field or its stacked effect. Still, as a component of efficient transformers, it helps maintain some positional awareness when full attention can’t be applied.

Spline-Based Positional Encoding

A very different idea is to eliminate fixed positional embeddings and instead treat the sequence as points along a continuous path (spline) in latent space . Spline-based Transformers (ECCV 2024) generate a smooth trajectory through latent space that passes through the token embeddings in order . The position of a token is implicitly encoded by where it lies on this trajectory. This method was shown to handle length extrapolation well (since one can extend the spline) and even allows user control of the sequence by manipulating spline control points .
Useful for: Research scenarios exploring continuous sequence representations (e.g., treating text or motion as a continuous curve). It has been applied to sequences of images and animations, where continuity is a natural assumption .
Trade-offs: This approach is quite complex and not part of standard toolkits. Fitting a spline through high-dimensional points and integrating that into a transformer is non-trivial. Also, ensuring the spline truly captures all necessary position information (and not too much or too little) is challenging. It’s a promising direction, but mainly of theoretical interest at this stage.

No Positional Encoding (Implicit Position Learning)

Interestingly, a few works have investigated transformers without any explicit positional encoding. In a decoder-only (causal) transformer, the autoregressive masking provides a weak sense of order (each token can only attend to earlier ones) – this breaks permutation symmetry enough that the model might infer positions internally. Indeed, some language models without position embeddings still achieve competitive performance, and they can generalize in-distribution surprisingly well . One study found that explicit position embeddings are not essential for decoder-only Transformers to generalize to longer sequences in language modeling . The network can internally learn positional cues (perhaps via parameters in the attention layers) up to a point.
Useful for: This isn’t a technique to recommend for strong performance, but it’s a noteworthy edge case. For purely generative models where relative ordering is the primary concern and absolute positioning (like “token 5 vs token 50”) isn’t queried, a model might cope without formal encodings. It suggests some redundancy – the transformer can partially overcome the lack of position info by other means (e.g., using content cues like punctuation or just learning an order implicitly in weights).
Trade-offs: Omission of positional encoding usually hurts tasks like translation or parity checking (which require knowing exact positions). Even language models will fail on tasks that ask about token index or require counting tokens. Thus, while possible, it’s not common to rely on this. It mainly highlights that causal masks introduce a form of position signal (earlier tokens have fewer opportunities to be attended than later ones), but for best results, explicit encoding is still the norm.

References

The strategies above are drawn from a mix of foundational work and recent research. For instance, Vaswani et al. (2017) introduced sinusoidal encodings ; Shaw et al. (2018) proposed relative position representations ; Devlin et al. (2019) used learned embeddings in BERT. More recent methods like RoPE , ALiBi , TUPE , and DeBERTa have pushed performance, while FIRE , KERPLE , and CAPE focus on length generalization. Vision and graph models brought forth CPE and Laplacian eigenvector encodings . Each method involves trade-offs in complexity, generalization, and domain applicability, so the choice should align with the task requirements – e.g. use simple absolute encodings for fixed-length scenarios, relative or bias-based encodings for better generalization, and advanced or adaptive methods when pushing the limits of sequence length or when working with unusual data modalities. The continuing evolution of positional encoding research aims to make Transformers even more flexible across diverse tasks and input structures.

2 replies

This comment was marked as off-topic.

Sign in to view

CCranney May 23, 2025
Maintainer Author

I appreciate that you reached out for collaborative discussion, but this was likely not the place to do so. I recommend making a new discussion under "ideas" to consider ways AttentionSmithy could interact with this. If your intent was to avoid the use of AttentionSmithy entirely, and just wanted to put this on my radar, I advise sending me an email in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Position Encoding Extension Brainstorm #6

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

This comment was marked as off-topic.

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Position Encoding Extension Brainstorm #6

Uh oh!

CCranney Feb 13, 2025 Maintainer

Replies: 3 comments · 2 replies

Uh oh!

CCranney Feb 13, 2025 Maintainer Author

Uh oh!

CCranney Feb 26, 2025 Maintainer Author

Uh oh!

CCranney Feb 26, 2025 Maintainer Author

Additional Positional Encoding Strategies for Transformers

Absolute Positional Encodings (Fixed or Learned)

Sinusoidal Absolute Encoding

Learned Absolute Embeddings

Time-Aware Absolute Encoding (tAPE)

Positional Interpolation (for Pretrained Models)

Relative Positional Encodings (Content-Independent)

Shaw et al.’s Additive Relative Encoding

T5-Style Relative Position Buckets

ALiBi (Attention Linear Bias)

Efficient Relative Position Encoding (eRPE)

Rotary and Hybrid Position Encodings

Rotary Position Embeddings (RoPE)

Extrapolatable (xPos) and Other Rotary Variants

“Untied” or Disentangled Position Representations

Advanced Strategies for Long Contexts and Extrapolation

FIRE (Functional Interpolation of Relative Embeddings)

KERPLE (Kernelized Relative Positional Embedding)

CAPE (Context-Adaptive Positional Encoding)

Conditional and Domain-Specific Encodings

Conditional Positional Encodings (PEG/CPE for Vision)

Multi-Dimensional (Axial) Encodings

Graph Positional Encodings

Other Noteworthy Approaches and Considerations

Convolution-Based Encodings (convSPE)

Spline-Based Positional Encoding

No Positional Encoding (Implicit Position Learning)

References

This comment was marked as off-topic.

Uh oh!

CCranney May 23, 2025 Maintainer Author

CCranney
Feb 13, 2025
Maintainer

Replies: 3 comments 2 replies

CCranney
Feb 13, 2025
Maintainer Author

CCranney
Feb 26, 2025
Maintainer Author

CCranney
Feb 26, 2025
Maintainer Author

CCranney May 23, 2025
Maintainer Author