residual connection

Hi,

First of all, thank you very much for your implementation; it saved me a lot of time. 

I've noticed a minor difference in comparison to the CoatNet paper.  ( ( file coatnet.py , line : [205...214] )

In the ConvTransformer class, you've defined a residual block only for the MultiHeadSelfAttention block. However, in the CoatNet paper and more broadly in the literature, a residual connection is typically applied to both the dense projection block and the multi-head attention block. This approach improves the gradient flow through the deep architecture, enhancing stability. Could you please consider updating your code so that future GitHub users can benefit from your excellent work?

Best regards,
Alae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

residual connection #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

residual connection #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions