Skip to content

Architecture question #7

@jyan1999

Description

@jyan1999

Hi,

Got here from your article. I got a couple of quick questions on the choice of architecture for your model. Specifically, if we are using the transformer architecture, why are we aggregating the results with a pooling layer? The attention mechanism works best for sequential input and sequential output as it learns how each token is related to other tokens in the sequence. Aggregating all the token feels like it defeats the purpose of using a transformer model. I feel like using a mask with the decoder layer architecture will be more appropriate in this case?

Sorry if I am misunderstanding your approach and any clarifications will be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions