Question: Number of layers of the position wise MLP in transformer block. 

Hi, 
Thanks for the well written code ! 
I was wondering if you've explored the impact of the number of layers in the position wise MLP in the transformer block. Because if i'm not mistaken, in most of the implementation i saw (like https://github.com/kimiyoung/transformer-xl/tree/master which is cited in the Stabilizing transformer for RL : https://arxiv.org/abs/1910.06764 ) and even in the original transformer paper, they use a MLP with 2 layers and a ReLu between them.  

So i was wondering if your choice of having a single layer with a ReLU ( in TransformerBlock :  self.fc = nn.Sequential(nn.Linear(embed_dim, embed_dim), nn.ReLU()))  is due to empirical tests you've done or works i'm not aware of ? 
I'm not aware of any work that study the impact of the architecture of the position wise MLP in the transformer block. Which i guess might be hard to do properly as for example adding a layer changes the total number of parameters.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: Number of layers of the position wise MLP in transformer block. #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question: Number of layers of the position wise MLP in transformer block. #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions