where is transformer decoder #2866

wangjiawen2013 · 2025-03-03T09:24:28Z

wangjiawen2013
Mar 3, 2025

Hi,
Only transformer encoder was used in burn's example (https://github.com/tracel-ai/burn/tree/main/examples/text-generation), don't we need decoder here ? If so, when and where to use the decoder ?

laggui · 2025-03-03T14:01:46Z

laggui
Mar 3, 2025
Maintainer

This might be a bit misleading, but while the model uses the TransformerEncoder block here it is actually a decoder-only architecture that uses masked self-attention for autoregressive text generation. I'll try to give a bit more context on the nomenclature.

The TransformerEncoder and TransformerDecoder are based on the original paper "Attention is all you need" which had both an encoder and a decoder (i.e. encoder-decoder architecture) working together for tasks like machine translation. The decoder has an an additional cross-attention layer that attends to the encoder's output.

Decoder-only models used for text generation have no separate encoder as they only rely on previous tokens. The text is generated autoregressively by predicting one token at a time and uses masked self-attention to restrict access to past tokens (causal). While this is called decoder-only, the transformer block used does not have a cross-attention like the original transformer encoder block since there is no encoder output to attend to. So in this case, the TransformerEncoder block matches the decoder architecture.

There also exists encoder-only models like BERT, typically used for full context task, which do not have autoregressive masking.
Encoder-only models (like BERT) use bidirectional self-attention where each token can see all other tokens, making them ideal for understanding tasks like classification and sentiment analysis.

Hopefully this helps 😅

4 replies

wangjiawen2013 Mar 6, 2025
Author

The nomenclatures are a bit misleading, becuase in the documentation, TransformerEncoder is described as "The transformer encoder module as describe in the paper [Attention Is All You Need]" (https://docs.rs/burn/latest/burn/nn/transformer/struct.TransformerEncoder.html) and TransformerDecoder is described as "The transformer decoder module as describe in the paper Attention Is All You Need."

Another question:
BERT is a encoder-only model or a decoder-only model ? TransformerEncoder is a decoder-only architecture, then which model has both an encoder and decoder ?

The third question:
If I understand correctly, Burn's TransformerEncoder block here it is actually a decoder-only architecture, while burn's TransformerDecoder is actually a encoder-decoder architecture ?

wangjiawen2013 Mar 6, 2025
Author

In summary, I am becoming more confused about TransformerEncoder and TransformerDecoder, could you give an example on TransformerDecoder ? 🤣🤣🤣
I read through the source code:

pub struct TransformerEncoderLayer<B: Backend> {
    mha: MultiHeadAttention<B>,
    pwff: PositionWiseFeedForward<B>,
    norm_1: LayerNorm<B>,
    norm_2: LayerNorm<B>,
    dropout: Dropout,
    norm_first: bool,
}

pub struct TransformerDecoderLayer<B: Backend> {
    cross_attn: MultiHeadAttention<B>,
    self_attn: MultiHeadAttention<B>,
    pwff: PositionWiseFeedForward<B>,
    norm_1: LayerNorm<B>,
    norm_2: LayerNorm<B>,
    norm_3: LayerNorm<B>,
    dropout: Dropout,
    norm_first: bool,
}

According to the source code, I think TransformerEncoder is actually decoder-only architecture and TransformerDecoder is a full encoder-decoder architecture. So is it better to rename TransformerEncoder as TransformerDecoder and rename TransformerDecoder to TransformerEncoderDecoder or TransformerFull ?

laggui Mar 6, 2025
Maintainer

The nomenclatures are a bit misleading, becuase in the documentation, TransformerEncoder is described as "The transformer encoder module as describe in the paper Attention Is All You Need and TransformerDecoder is described as "The transformer decoder module as describe in the paper Attention Is All You Need."

The descriptions from the docstrings match the names of the structs. The encoder and decoder match the modules presented in "Attention Is All You Need". These transformer blocks were introduced as part of an encoder-decoder architecture. Like I said, the main difference between the two blocks is that the decoder has cross-attention layer to attend to the encoder's output. An encoder-decoder architecture is typically used when you have a source and target sequence like in language translation or question answering.

But for other applications, an encoder or decoder might not be required. A decoder-only model is used same without the encoder and cross attention step. If you remove the cross-attention (including the additional norm layer) from the TransformerDecoder, you essentially have a TransformerEncoder. So the decoder-only models are essentially a decoder without cross-attention, which can be represented by the TransformerEncoder block from "Attention Is All You Need". you would use it for next token prediction etc.

According to the source code, I think TransformerEncoder is actually decoder-only architecture and TransformerDecoder is a full encoder-decoder architecture.

This statement is not accurate. TransformerDecoder is not a full encoder-decoder architecture, it only represents the decoder part of an encoder-decoder architecture. Think of the TransformerEncoder as a transformer block like the decoder but without the cross-attention. So while it was proposed in the original paper to use as the encoder, if you want a transformer block for a decoder-only model for something like next token prediction (i.e., no cross-attention to attend to the encoder since there is none), you can use this block 🙂 These models are called decoder-only since there is not much "encoding" going on.

Maybe this explanation can provide even more details: https://ai.stackexchange.com/questions/40179/how-does-the-decoder-only-transformer-architecture-work

Hopefully I cleared a bit of confusion 😅

Another question:
BERT is a encoder-only model or a decoder-only model ? ~~TransformerEncoder is a decoder-only architecture~~, then which model has both an encoder and decoder ?

BERT is an encoder-only model. It is used to transform (i.e., encode) an input sequence into a rich contextual representation used for downstream tasks. The striked-out part is not accurate, as I mentioned previously.

wangjiawen2013 Mar 7, 2025
Author

Got it, thanks!

wangjiawen2013 · 2025-04-27T03:00:02Z

wangjiawen2013
Apr 27, 2025
Author

I think I finally understand what you said.

This might be a bit misleading, but while the model uses the TransformerEncoder block here it is actually a decoder-only architecture that uses masked self-attention for autoregressive text generation. I'll try to give a bit more context on the nomenclature.

and

There also exists encoder-only models like BERT, typically used for full context task, which do not have autoregressive masking.
Encoder-only models (like BERT) use bidirectional self-attention where each token can see all other tokens, making them ideal for understanding tasks like classification and sentiment analysis.

So If I want use bidirectional self-attention in TranformerEncoder, I can set the input using function [mask_attn](https://docs.rs/burn/latest/burn/nn/transformer/struct.TransformerEncoderInput.html#method.mask_attn), right ?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

where is transformer decoder #2866

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

where is transformer decoder #2866

Uh oh!

wangjiawen2013 Mar 3, 2025

Replies: 2 comments · 4 replies

Uh oh!

laggui Mar 3, 2025 Maintainer

Uh oh!

Uh oh!

wangjiawen2013 Mar 6, 2025 Author

Uh oh!

wangjiawen2013 Mar 6, 2025 Author

Uh oh!

laggui Mar 6, 2025 Maintainer

Uh oh!

wangjiawen2013 Mar 7, 2025 Author

Uh oh!

Uh oh!

wangjiawen2013 Apr 27, 2025 Author

wangjiawen2013
Mar 3, 2025

Replies: 2 comments 4 replies

laggui
Mar 3, 2025
Maintainer

wangjiawen2013 Mar 6, 2025
Author

wangjiawen2013 Mar 6, 2025
Author

laggui Mar 6, 2025
Maintainer

wangjiawen2013 Mar 7, 2025
Author

wangjiawen2013
Apr 27, 2025
Author