Merge #1763

bors[bot] · NightMachinery · web-flow · commit 85c17ee3b3c2 · 2021-11-26T14:41:59.000Z
1763: Doc update (recurrence.md): fixed incorrect output dimensions, clarified batching. r=CarloLucibello a=NightMachinary



Co-authored-by: NightMachinary &lt;rudiwillalwaysloveyou@gmail.com&gt;
Co-authored-by: NightMachinary &lt;36224762+NightMachinary@users.noreply.github.com&gt;
diff --git a/docs/src/models/recurrence.md b/docs/src/models/recurrence.md
@@ -8,22 +8,24 @@ To introduce Flux's recurrence functionalities, we will consider the following v
 
 In the above, we have a sequence of length 3, where `x1` to `x3` represent the input at each step (could be a timestamp or a word in a sentence), and `y1` to `y3` are their respective outputs.
 
-An aspect to recognize is that in such model, the recurrent cells `A` all refer to the same structure. What distinguishes it from a simple dense layer is that the cell `A` is fed, in addition to an input `x`, with information from the previous state of the model (hidden state denoted as `h1` & `h2` in the diagram).
+An aspect to recognize is that in such a model, the recurrent cells `A` all refer to the same structure. What distinguishes it from a simple dense layer is that the cell `A` is fed, in addition to an input `x`, with information from the previous state of the model (hidden state denoted as `h1` & `h2` in the diagram).
 
 In the most basic RNN case, cell A could be defined by the following: 
 
 ```julia
-Wxh = randn(Float32, 5, 2)
-Whh = randn(Float32, 5, 5)
-b   = randn(Float32, 5)
+output_size = 5
+input_size = 2
+Wxh = randn(Float32, output_size, input_size)
+Whh = randn(Float32, output_size, output_size)
+b   = randn(Float32, output_size)
 
 function rnn_cell(h, x)
     h = tanh.(Wxh * x .+ Whh * h .+ b)
     return h, h
 end
 
-x = rand(Float32, 2) # dummy data
-h = rand(Float32, 5)  # initial hidden state
+x = rand(Float32, input_size) # dummy input data
+h = rand(Float32, output_size) # random initial hidden state
 
 h, y = rnn_cell(h, x)
 ```
@@ -84,9 +86,8 @@ Using the previously defined `m` recurrent model, we can now apply it to a singl
 julia> x = rand(Float32, 2);
 
 julia> m(x)
-2-element Vector{Float32}:
- -0.12852919
-  0.009802654
+1-element Vector{Float32}:
+ 0.31759313
 ```
 
 The `m(x)` operation would be represented by `x1 -> A -> y1` in our diagram.
@@ -103,9 +104,9 @@ julia> x = [rand(Float32, 2) for i = 1:3];
 
 julia> [m(xi) for xi in x]
 3-element Vector{Vector{Float32}}:
- [-0.018976994, 0.61098206]
- [-0.8924057, -0.7512169]
- [-0.34613007, -0.54565114]
+ [-0.033448644]
+ [0.5259023]
+ [-0.11183384]
 ```
 
 !!! warning "Use of map and broadcast"
@@ -175,9 +176,39 @@ x = [rand(Float32, 2, 4) for i = 1:3]
 y = [rand(Float32, 1, 4) for i = 1:3]
 ```
 
-That would mean that we have 4 sentences (or samples), each with 2 features (let's say a very small embedding!) and each with a length of 3 (3 words per sentence). Computing `m(batch[1])`, would still represent `x1 -> y1` in our diagram and returns the first word output, but now for each of the 4 independent sentences (second dimension of the input matrix).
+That would mean that we have 4 sentences (or samples), each with 2 features (let's say a very small embedding!) and each with a length of 3 (3 words per sentence). Computing `m(batch[1])`, would still represent `x1 -> y1` in our diagram and returns the first word output, but now for each of the 4 independent sentences (second dimension of the input matrix). We do not need to use `Flux.reset!(m)` here; each sentence in the batch will output in its own "column", and the outputs of the different sentences won't mix. 
+
+To illustrate, we go through an example of batching with our implementation of `rnn_cell`. The implementation doesn't need to change; the batching comes for "free" from the way Julia does broadcasting and the rules of matrix multiplication.
+
+```julia
+output_size = 5
+input_size = 2
+Wxh = randn(Float32, output_size, input_size)
+Whh = randn(Float32, output_size, output_size)
+b   = randn(Float32, output_size)
+
+function rnn_cell(h, x)
+    h = tanh.(Wxh * x .+ Whh * h .+ b)
+    return h, h
+end
+```
+
+Here, we use the last dimension of the input and the hidden state as the batch dimension. I.e., `h[:, n]` would be the hidden state of the nth sentence in the batch.
+
+```julia
+batch_size = 4
+x = rand(Float32, input_size, batch_size) # dummy input data
+h = rand(Float32, output_size, batch_size) # random initial hidden state
+
+h, y = rnn_cell(h, x)
+``` 
+
+```julia
+julia> size(h) == size(y) == (output_size, batch_size)
+true
+```
 
-In many situations, such as when dealing with a language model, each batch typically contains independent sentences, so we cannot handle the model as if each batch was the direct continuation of the previous one. To handle such situation, we need to reset the state of the model between each batch, which can be conveniently performed within the loss function:
+In many situations, such as when dealing with a language model, the sentences in each batch are independent (i.e. the last item of the first sentence of the first batch is independent from the first item of the first sentence of the second batch), so we cannot handle the model as if each batch was the direct continuation of the previous one. To handle such situations, we need to reset the state of the model between each batch, which can be conveniently performed within the loss function:
 
 ```julia
 function loss(x, y)