From 8924148d48ca01e3eaef8a81c2fe731a4e70920d Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Tue, 4 Oct 2022 23:14:12 -0400
Subject: [PATCH 1/6] re-organise built-in layers section

---
 docs/src/models/layers.md | 106 +++++++++++++++++++++++++++-----------
 src/layers/basic.jl       |   3 ++
 2 files changed, 78 insertions(+), 31 deletions(-)

diff --git a/docs/src/models/layers.md b/docs/src/models/layers.md
index 6230637744..80e7fcc5fb 100644
--- a/docs/src/models/layers.md
+++ b/docs/src/models/layers.md
@@ -1,86 +1,130 @@
-# Basic Layers
+# Built-in Layer Types
 
-These core layers form the foundation of almost all neural networks.
+If you started at the beginning, then you have already met the basic [`Dense`](@ref) layer, and seen [`Chain`](@ref) for combining layers. These core layers form the foundation of almost all neural networks. 
+
+The `Dense` layer 
+
+* Weight matrices are created ... Many layers take an `init` keyword, accepts a function acting like `rand`. That is, `init(2,3,4)` creates an array of this size.  ... always on the CPU. 
+
+* An activation function. This is broadcast over the output: `Flux.Scale(3, tanh)([1,2,3]) ≈ tanh.(1:3)`
+
+* The bias vector is always intialised `Flux.zeros32`. The keyword `bias=false` will turn this off.
+
+
+* All layers are annotated with `@layer`, which means that `params` will see the contents, and `gpu` will move their arrays to the GPU.
+
+
+## Fully Connected
 
 ```@docs
-Chain
 Dense
+Flux.Bilinear
+Flux.Scale
 ```
 
-## Convolution and Pooling Layers
+Perhaps `Scale` isn't quite fully connected, but it may be thought of as `Dense(Diagonal(s.weights), s.bias)`, and LinearAlgebra's `Diagonal` is a matrix which just happens to contain many zeros.
+
+## Convolution Models
 
 These layers are used to build convolutional neural networks (CNNs).
 
+They all expect images in what is called WHCN order: a batch of 32 colour images, each 50 x 50 pixels, will have `size(x) == (50, 50, 3, 32)`. A single grayscale image might instead have `size(x) == (28, 28, 1, 1)`.
+
+Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have `size(x) == (1000, 2, 1)`. They will also work with 3D data, `ndims(x) == 5`, where again the last two dimensions are channel and batch.
+
+To understand how `stride` ?? there's a cute article.
+
 ```@docs
 Conv
 Conv(weight::AbstractArray)
-AdaptiveMaxPool
-MaxPool
-GlobalMaxPool
-AdaptiveMeanPool
-MeanPool
-GlobalMeanPool
-DepthwiseConv
 ConvTranspose
 ConvTranspose(weight::AbstractArray)
 CrossCor
 CrossCor(weight::AbstractArray)
+DepthwiseConv
 SamePad
 Flux.flatten
 ```
 
-## Upsampling Layers
+### Pooling
+
+These layers are commonly used after a convolution layer, and reduce the size of its output. They have no trainable parameters.
+
+```@docs
+AdaptiveMaxPool
+MaxPool
+GlobalMaxPool
+AdaptiveMeanPool
+MeanPool
+GlobalMeanPool
+```
+
+## Upsampling
+
+The opposite of pooling, these layers increase the size of an array. They have no trainable parameters. 
 
 ```@docs
 Upsample
 PixelShuffle
 ```
 
-## Recurrent Layers
+## Embedding Vectors
 
-Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).
+These layers accept an index, and return a vector (or several indices, and several vectors). The possible embedding vectors are learned parameters.
 
 ```@docs
-RNN
-LSTM
-GRU
-GRUv3
-Flux.Recur
-Flux.reset!
+Flux.Embedding
+Flux.EmbeddingBag
 ```
 
-## Other General Purpose Layers
+## Dataflow Layers, or Containers
 
-These are marginally more obscure than the Basic Layers.
-But in contrast to the layers described in the other sections are not readily grouped around a particular purpose (e.g. CNNs or RNNs).
+The basic `Chain(F, G, H)` applies the layers it contains in sequence, equivalent to `H ∘ G ∘ F`. Flux has some other layers which contain layers, but connect them up in a more complicated way: `SkipConnection` allows ResNet's ??residual connection.
+
+These are all defined with [`@layer`](@ref)` :exand TypeName`, which tells the pretty-printing code that they contain other layers.
 
 ```@docs
+Chain
+Flux.activations
 Maxout
 SkipConnection
 Parallel
-Flux.Bilinear
-Flux.Scale
-Flux.Embedding
+PairwiseFusion
+```
+
+## Recurrent Models
+
+Much like the core layers above, but can be used to process sequence data (as well as other kinds of structured data).
+
+```@docs
+RNN
+LSTM
+GRU
+GRUv3
+Flux.Recur
+Flux.reset!
 ```
 
 ## Normalisation & Regularisation
 
-These layers don't affect the structure of the network but may improve training times or reduce overfitting.
+These layers don't affect the structure of the network but may improve training times or reduce overfitting. Some of them contain trainable parameters, while others do not.
 
 ```@docs
-Flux.normalise
 BatchNorm
 Dropout
-Flux.dropout
 AlphaDropout
 LayerNorm
 InstanceNorm
 GroupNorm
+Flux.normalise
+Flux.dropout
 ```
 
-### Testmode
+### Test vs. Train
+
+Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference. 
 
-Many normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference. Still, depending on your use case, it may be helpful to manually specify when these layers should be treated as being trained or not. For this, Flux provides `Flux.testmode!`. When called on a model (e.g. a layer or chain of layers), this function will place the model into the mode specified.
+The functions `Flux.trainmode!` and `Flux.testmode!` let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.
 
 ```@docs
 Flux.testmode!
diff --git a/src/layers/basic.jl b/src/layers/basic.jl
index 2a3bc9131c..9524c0c284 100644
--- a/src/layers/basic.jl
+++ b/src/layers/basic.jl
@@ -182,6 +182,9 @@ function Base.show(io::IO, l::Dense)
   print(io, ")")
 end
 
+Dense(W::LinearAlgebra.Diagonal, bias = true, σ = identity) =
+  Scale(W.diag, bias, σ)
+
 """
     Scale(size::Integer..., σ=identity; bias=true, init=ones32)
     Scale(scale::AbstractArray, [bias, σ])

From e0971b4720752f2635a7cd973a57f52f809fd2bd Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Sat, 19 Nov 2022 16:40:33 -0500
Subject: [PATCH 2/6] fixup

---
 docs/src/models/layers.md | 23 +++++++++++++----------
 docs/src/utilities.md     |  2 +-
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/docs/src/models/layers.md b/docs/src/models/layers.md
index 80e7fcc5fb..08874236e7 100644
--- a/docs/src/models/layers.md
+++ b/docs/src/models/layers.md
@@ -1,18 +1,21 @@
 # Built-in Layer Types
 
-If you started at the beginning, then you have already met the basic [`Dense`](@ref) layer, and seen [`Chain`](@ref) for combining layers. These core layers form the foundation of almost all neural networks. 
+If you started at the beginning of the guide, then you have already met the
+basic [`Dense`](@ref) layer, and seen [`Chain`](@ref) for combining layers.
+These core layers form the foundation of almost all neural networks.
 
-The `Dense` layer 
+The `Dense` exemplifies several features:
 
-* Weight matrices are created ... Many layers take an `init` keyword, accepts a function acting like `rand`. That is, `init(2,3,4)` creates an array of this size.  ... always on the CPU. 
+* It contains an an [activation function](@ref man-activations), which is broadcasted over the output. Because this broadcast can be fused with other operations, doing so is more efficient than applying the activation function separately.
 
-* An activation function. This is broadcast over the output: `Flux.Scale(3, tanh)([1,2,3]) ≈ tanh.(1:3)`
+* It take an `init` keyword, which accepts a function acting like `rand`. That is, `init(2,3,4)` should create an array of this size. Flux has [many such functions](@ref man-init-funcs) built-in. All make a CPU array, moved later with [gpu](@ref Flux.gpu) if desired.
 
-* The bias vector is always intialised `Flux.zeros32`. The keyword `bias=false` will turn this off.
+* The bias vector is always intialised [`Flux.zeros32`](@ref). The keyword `bias=false` will turn this off, i.e. keeping the bias permanently zero.
 
+* It is annotated with [`@layer`](@ref Flux.@layer), which means that [`params`](@ref Flux.params) will see the contents, and [gpu](@ref Flux.gpu) will move their arrays to the GPU.
 
-* All layers are annotated with `@layer`, which means that `params` will see the contents, and `gpu` will move their arrays to the GPU.
-
+By contrast, `Chain` itself contains no parameters, but connects other layers together.
+The section on [dataflow layers](@ref man-dataflow-layers) introduces others like this,
 
 ## Fully Connected
 
@@ -32,7 +35,7 @@ They all expect images in what is called WHCN order: a batch of 32 colour images
 
 Besides images, 2D data, they also work with 1D data, where for instance stereo sound recording with 1000 samples might have `size(x) == (1000, 2, 1)`. They will also work with 3D data, `ndims(x) == 5`, where again the last two dimensions are channel and batch.
 
-To understand how `stride` ?? there's a cute article.
+To understand how strides and padding work, the article by [Dumoulin & Visin](https://arxiv.org/abs/1603.07285) has great illustrations.
 
 ```@docs
 Conv
@@ -77,9 +80,9 @@ Flux.Embedding
 Flux.EmbeddingBag
 ```
 
-## Dataflow Layers, or Containers
+## [Dataflow Layers, or Containers](@id man-dataflow-layers)
 
-The basic `Chain(F, G, H)` applies the layers it contains in sequence, equivalent to `H ∘ G ∘ F`. Flux has some other layers which contain layers, but connect them up in a more complicated way: `SkipConnection` allows ResNet's ??residual connection.
+The basic `Chain(F, G, H)` applies the layers it contains in sequence, equivalent to `H ∘ G ∘ F`. Flux has some other layers which contain layers, but connect them up in a more complicated way: `SkipConnection` allows ResNet's residual connection.
 
 These are all defined with [`@layer`](@ref)` :exand TypeName`, which tells the pretty-printing code that they contain other layers.
 
diff --git a/docs/src/utilities.md b/docs/src/utilities.md
index a6f963fa58..e41e0c5e51 100644
--- a/docs/src/utilities.md
+++ b/docs/src/utilities.md
@@ -1,4 +1,4 @@
-# Random Weight Initialisation
+# [Random Weight Initialisation](@id man-init-funcs)
 
 Flux initialises convolutional layers and recurrent cells with `glorot_uniform` by default.
 Most layers accept a function as an `init` keyword, which replaces this default. For example:

From 19f7ee7accc1815f34262bc73a7621daf7b1f198 Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Tue, 29 Nov 2022 11:31:36 -0500
Subject: [PATCH 3/6] more create_bias somewhere more logical

---
 docs/src/models/basics.md | 1 -
 docs/src/utilities.md     | 1 +
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/models/basics.md b/docs/src/models/basics.md
index e82c25185e..d1335ff229 100644
--- a/docs/src/models/basics.md
+++ b/docs/src/models/basics.md
@@ -233,5 +233,4 @@ Affine(3 => 1, bias=false, init=ones) |> gpu
 
 ```@docs
 Functors.@functor
-Flux.create_bias
 ```
diff --git a/docs/src/utilities.md b/docs/src/utilities.md
index e41e0c5e51..a9a2091f2a 100644
--- a/docs/src/utilities.md
+++ b/docs/src/utilities.md
@@ -42,6 +42,7 @@ Flux.ones32
 Flux.zeros32
 Flux.rand32
 Flux.randn32
+Flux.create_bias
 ```
 
 These functions call:

From fbfc25f27ce0211a411e4d90f5b2dbbcb4a58023 Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Tue, 29 Nov 2022 11:32:02 -0500
Subject: [PATCH 4/6] add a warning about other AD breaking automagic train
 mode

---
 docs/src/models/layers.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/docs/src/models/layers.md b/docs/src/models/layers.md
index 08874236e7..bb61f0faaa 100644
--- a/docs/src/models/layers.md
+++ b/docs/src/models/layers.md
@@ -127,6 +127,11 @@ Flux.dropout
 
 Several normalisation layers behave differently under training and inference (testing). By default, Flux will automatically determine when a layer evaluation is part of training or inference. 
 
+!!! warning
+    This automatic train/test detection works best with Zygote, the default
+    automatic differentiation package. It may not work with other packages
+    such as Tracker, Yota, or ForwardDiff.
+
 The functions `Flux.trainmode!` and `Flux.testmode!` let you manually specify which behaviour you want. When called on a model, they will place all layers within the model into the specified mode.
 
 ```@docs

From 0f71aacdccaf4fa2a7a3cc9d574e0ca924a55891 Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Tue, 29 Nov 2022 11:33:39 -0500
Subject: [PATCH 5/6] remove mention of at-layer macro for now

---
 docs/src/models/layers.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/docs/src/models/layers.md b/docs/src/models/layers.md
index bb61f0faaa..40f8f41b6d 100644
--- a/docs/src/models/layers.md
+++ b/docs/src/models/layers.md
@@ -12,7 +12,7 @@ The `Dense` exemplifies several features:
 
 * The bias vector is always intialised [`Flux.zeros32`](@ref). The keyword `bias=false` will turn this off, i.e. keeping the bias permanently zero.
 
-* It is annotated with [`@layer`](@ref Flux.@layer), which means that [`params`](@ref Flux.params) will see the contents, and [gpu](@ref Flux.gpu) will move their arrays to the GPU.
+* It is annotated with [`@functor`](@ref Functors.@functor), which means that [`params`](@ref Flux.params) will see the contents, and [gpu](@ref Flux.gpu) will move their arrays to the GPU.
 
 By contrast, `Chain` itself contains no parameters, but connects other layers together.
 The section on [dataflow layers](@ref man-dataflow-layers) introduces others like this,
@@ -84,8 +84,6 @@ Flux.EmbeddingBag
 
 The basic `Chain(F, G, H)` applies the layers it contains in sequence, equivalent to `H ∘ G ∘ F`. Flux has some other layers which contain layers, but connect them up in a more complicated way: `SkipConnection` allows ResNet's residual connection.
 
-These are all defined with [`@layer`](@ref)` :exand TypeName`, which tells the pretty-printing code that they contain other layers.
-
 ```@docs
 Chain
 Flux.activations

From fb0757e9edf4784f4ac3862c3a9a6ac132fcfa21 Mon Sep 17 00:00:00 2001
From: Michael Abbott <32575566+mcabbott@users.noreply.github.com>
Date: Wed, 30 Nov 2022 08:21:34 -0500
Subject: [PATCH 6/6] fix some links

Co-authored-by: Saransh Chopra <saransh0701@gmail.com>
---
 docs/src/models/layers.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/src/models/layers.md b/docs/src/models/layers.md
index 40f8f41b6d..3714f434e4 100644
--- a/docs/src/models/layers.md
+++ b/docs/src/models/layers.md
@@ -8,11 +8,11 @@ The `Dense` exemplifies several features:
 
 * It contains an an [activation function](@ref man-activations), which is broadcasted over the output. Because this broadcast can be fused with other operations, doing so is more efficient than applying the activation function separately.
 
-* It take an `init` keyword, which accepts a function acting like `rand`. That is, `init(2,3,4)` should create an array of this size. Flux has [many such functions](@ref man-init-funcs) built-in. All make a CPU array, moved later with [gpu](@ref Flux.gpu) if desired.
+* It take an `init` keyword, which accepts a function acting like `rand`. That is, `init(2,3,4)` should create an array of this size. Flux has [many such functions](@ref man-init-funcs) built-in. All make a CPU array, moved later with [`gpu`](@ref Flux.gpu) if desired.
 
 * The bias vector is always intialised [`Flux.zeros32`](@ref). The keyword `bias=false` will turn this off, i.e. keeping the bias permanently zero.
 
-* It is annotated with [`@functor`](@ref Functors.@functor), which means that [`params`](@ref Flux.params) will see the contents, and [gpu](@ref Flux.gpu) will move their arrays to the GPU.
+* It is annotated with [`@functor`](@ref Functors.@functor), which means that [`params`](@ref Flux.params) will see the contents, and [`gpu`](@ref Flux.gpu) will move their arrays to the GPU.
 
 By contrast, `Chain` itself contains no parameters, but connects other layers together.
 The section on [dataflow layers](@ref man-dataflow-layers) introduces others like this,