Architecture question

Hi,

Got here from your [article](https://towardsdatascience.com/stock-predictions-with-state-of-the-art-transformer-and-time-embeddings-3a4485237de6). I got a couple of quick questions on the choice of architecture for your model. Specifically, if we are using the transformer architecture, why are we aggregating the results with a pooling layer? The attention mechanism works best for sequential input and sequential output as it learns how each token is related to other tokens in the sequence. Aggregating all the token feels like it defeats the purpose of using a transformer model. I feel like using a mask with the decoder layer architecture will be more appropriate in this case?

Sorry if I am misunderstanding your approach and any clarifications will be greatly appreciated! 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Architecture question #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Architecture question #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions