diff --git a/chapters/en/chapter1/6.mdx b/chapters/en/chapter1/6.mdx index e540c8624..125d09787 100644 --- a/chapters/en/chapter1/6.mdx +++ b/chapters/en/chapter1/6.mdx @@ -209,10 +209,10 @@ length. ### Axial positional encodings [Reformer](https://huggingface.co/docs/transformers/model_doc/reformer) uses axial positional encodings: in traditional transformer models, the positional encoding -E is a matrix of size \\(l\\) by \\(d\\), \\(l\\) being the sequence length and \\(d\\) the dimension of the +E is a matrix of size\ \\(l\\) by\ \\(d\\),\ \\(l\\) being the sequence length and\ \\(d\\) the dimension of the hidden state. If you have very long texts, this matrix can be huge and take way too much space on the GPU. To alleviate that, axial positional encodings consist of factorizing that big matrix E in two smaller matrices E1 and E2, with -dimensions \\(l_{1} \times d_{1}\\) and \\(l_{2} \times d_{2}\\), such that \\(l_{1} \times l_{2} = l\\) and +dimensions\ \\(l_{1} \times d_{1}\\) and \\(l_{2} \times d_{2}\\), such that \\(l_{1} \times l_{2} = l\\) and \\(d_{1} + d_{2} = d\\) (with the product for the lengths, this ends up being way smaller). The embedding for time step \\(j\\) in E is obtained by concatenating the embeddings for timestep \\(j \% l1\\) in E1 and \\(j // l1\\) in E2. @@ -221,4 +221,4 @@ in E2. In this section, we've explored the three main Transformer architectures and some specialized attention mechanisms. Understanding these architectural differences is crucial for selecting the right model for your specific NLP task. -As we move forward in the course, you'll get hands-on experience with these different architectures and learn how to fine-tune them for your specific needs. In the next section, we'll look at some of the limitations and biases present in these models that you should be aware of when deploying them. \ No newline at end of file +As we move forward in the course, you'll get hands-on experience with these different architectures and learn how to fine-tune them for your specific needs. In the next section, we'll look at some of the limitations and biases present in these models that you should be aware of when deploying them.