-
Notifications
You must be signed in to change notification settings - Fork 167
Open
Description
Hi,
Got here from your article. I got a couple of quick questions on the choice of architecture for your model. Specifically, if we are using the transformer architecture, why are we aggregating the results with a pooling layer? The attention mechanism works best for sequential input and sequential output as it learns how each token is related to other tokens in the sequence. Aggregating all the token feels like it defeats the purpose of using a transformer model. I feel like using a mask with the decoder layer architecture will be more appropriate in this case?
Sorry if I am misunderstanding your approach and any clarifications will be greatly appreciated!
Metadata
Metadata
Assignees
Labels
No labels