Merge #1462

bors[bot] · darsnack · web-flow · commit 33f99efbf27d · 2021-01-14T16:28:14.000Z
1462: Add Parallel layer r=DhairyaLGandhi a=darsnack Since #1289 stalled, I have added an implementation of `Parallel` with some of the changes we discussed during ML calls. This version excludes most of the structural layers in #1289 like `Join`, `Split`, and `Nop`. I also added the ability for the user to specify the reduction operator. If it is acceptable, I would like to remap `SkipConnection` to `Parallel` (not a deprecation exactly). The reason for submitting this PR now is because I am creating pre-trained weights for the networks in FluxML/Metalhead.jl#70, and there is a lot of code that can be replaced with a `Parallel`. So, I'd like to have `Parallel` in Flux before continuing with training to make the process easier. ### PR Checklist - [x] Tests are added - [x] Entry in NEWS.md - [x] Documentation, if applicable - [x] Final review from @DhairyaLGandhi (for API changes). cc @CarloLucibello Co-authored-by: Kyle Daruwalla <daruwalla@wisc.edu> Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>
diff --git a/NEWS.md b/NEWS.md
@@ -8,6 +8,7 @@
 * Removed kwarg only constructors for [`convolutional layers`](https://github.com/FluxML/Flux.jl/pull/1379).
 * Add [sparse initialization](https://github.com/FluxML/Flux.jl/pull/1454) as described in [Deep learning via Hessian-free optimization](https://dl.acm.org/doi/abs/10.5555/3104322.3104416).
 * Moved GPU CI to use buildkite instead of GitLab
+* New [`Parallel` layer](https://github.com/FluxML/Flux.jl/pull/1462) adds inception module-like building blocks.
 * Other new features and bug fixes (see GitHub releases page)
 
 ## v0.11.2
diff --git a/docs/src/models/advanced.md b/docs/src/models/advanced.md
@@ -70,3 +70,136 @@ by simply deleting it from `ps`:
 ps = params(m)
 delete!(ps, m[2].b) 
 ```
+
+## Custom multiple input or output layer
+
+Sometimes a model needs to receive several separate inputs at once or produce several separate outputs at once. In other words, there multiple paths within this high-level layer, each processing a different input or producing a different output. A simple example of this in machine learning literature is the [inception module](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf).
+
+Naively, we could have a struct that stores the weights of along each path and implement the joining/splitting in the forward pass function. But that would mean a new struct any time the operations along each path changes. Instead, this guide will show you how to construct a high-level layer (like [`Chain`](@ref)) that is made of multiple sub-layers for each path.
+
+### Multiple inputs: a custom `Join` layer
+
+Our custom `Join` layer will accept multiple inputs at once, pass each input through a separate path, then combine the results together. Note that this layer can already be constructed using [`Parallel`](@ref), but we will first walk through how do this manually.
+
+We start by defining a new struct, `Join`, that stores the different paths and a combine operation as its fields.
+```julia
+using Flux
+using CUDA
+
+# custom join layer
+struct Join{T, F}
+  combine::F
+  paths::T
+end
+
+# allow Join(op, m1, m2, ...) as a constructor
+Join(combine, paths...) = Join(combine, paths)
+```
+Notice that we parameterized the type of the `paths` field. This is necessary for fast Julia code; in general, `T` might be a `Tuple` or `Vector`, but we don't need to pay attention to what it specifically is. The same goes for the `combine` field.
+
+The next step is to use [`Flux.@functor`](@ref) to make our struct behave like a Flux layer. This is important so that calling `params` on a `Join` returns the underlying weight arrays on each path.
+```julia
+Flux.@functor Join
+```
+
+Finally, we define the forward pass. For `Join`, this means applying each `path` in `paths` to each input array, then using `combine` to merge the results.
+```julia
+(m::Join)(xs::Tuple) = m.combine(map((f, x) -> f(x), m.paths, xs))
+(m::Join)(xs...) = m(xs)
+```
+
+Lastly, we can test our new layer. Thanks to the proper abstractions in Julia, our layer works on GPU arrays out of the box!
+```julia
+model = Chain(
+          Join(vcat,
+            Chain(
+              Dense(1, 5),
+              Dense(5, 1)
+            ),
+            Dense(1, 2),
+            Dense(1, 1),
+          ),
+          Dense(4, 1)
+        ) |> gpu
+
+xs = map(gpu, (rand(1), rand(1), rand(1)))
+
+model(xs)
+# returns a single float vector with one value
+```
+
+#### Using `Parallel`
+
+Flux already provides [`Parallel`](@ref) that can offer the same functionality. In this case, `Join` is going to just be syntactic sugar for `Parallel`.
+```julia
+Join(combine, paths) = Parallel(combine, paths)
+Join(combine, paths...) = Join(combine, paths)
+
+# use vararg/tuple version of Parallel forward pass
+model = Chain(
+          Join(vcat,
+            Chain(
+              Dense(1, 5),
+              Dense(5, 1)
+            ),
+            Dense(1, 2),
+            Dense(1, 1),
+          ),
+          Dense(4, 1)
+        ) |> gpu
+
+xs = map(gpu, (rand(1), rand(1), rand(1)))
+
+model(xs)
+# returns a single float vector with one value
+```
+
+### Multiple outputs: a custom `Split` layer
+
+Our custom `Split` layer will accept a single input, then pass the input through a separate path to produce multiple outputs.
+
+We start by following the same steps as the `Join` layer: define a struct, use [`Flux.@functor`](@ref), and define the forward pass.
+```julia
+using Flux
+using CUDA
+
+# custom split layer
+struct Split{T}
+  paths::T
+end
+
+Split(paths...) = Split(paths)
+
+Flux.@functor Split
+
+(m::Split)(x::AbstractArray) = tuple(map(f -> f(x), m.paths))
+```
+
+Now we can test to see that our `Split` does indeed produce multiple outputs.
+```julia
+model = Chain(
+          Dense(10, 5),
+          CustomSplit(
+            Dense(5, 1),
+            Dense(5, 3),
+            Dense(5, 2)
+          )
+        ) |> gpu
+
+model(gpu(rand(10)))
+# returns a tuple with three float vectors
+```
+
+A custom loss function for the multiple outputs may look like this:
+```julia
+using Statistics
+
+# assuming model returns the output of a Split
+# x is a single input
+# ys is a tuple of outputs
+function loss(x, ys, model)
+  # rms over all the mse
+  ŷs = model(x)
+  return sqrt(mean(Flux.mse(y, ŷ) for (y, ŷ) in zip(ys, ŷs)))
+end
+```
diff --git a/docs/src/models/layers.md b/docs/src/models/layers.md
@@ -49,6 +49,7 @@ But in contrast to the layers described in the other sections are not readily gr
 ```@docs
 Maxout
 SkipConnection
+Parallel
 ```
 
 ## Normalisation & Regularisation
diff --git a/src/Flux.jl b/src/Flux.jl
@@ -11,10 +11,12 @@ using Zygote: Params, @adjoint, gradient, pullback, @nograd
 
 export gradient
 
-export Chain, Dense, Maxout, RNN, LSTM, GRU, SamePad, Conv, CrossCor, ConvTranspose,
-       AdaptiveMaxPool, AdaptiveMeanPool, GlobalMaxPool, GlobalMeanPool, MaxPool,
-       MeanPool, flatten, DepthwiseConv, Dropout, AlphaDropout, LayerNorm, BatchNorm,
-       InstanceNorm, GroupNorm, SkipConnection, params, fmap, cpu, gpu, f32, f64,
+export Chain, Dense, Maxout, SkipConnection, Parallel, flatten,
+       RNN, LSTM, GRU,
+       SamePad, Conv, CrossCor, ConvTranspose, DepthwiseConv,
+       AdaptiveMaxPool, AdaptiveMeanPool, GlobalMaxPool, GlobalMeanPool, MaxPool, MeanPool,
+       Dropout, AlphaDropout, LayerNorm, BatchNorm, InstanceNorm, GroupNorm,
+       params, fmap, cpu, gpu, f32, f64,
        testmode!, trainmode!
 
 include("optimise/Optimise.jl")
diff --git a/src/layers/basic.jl b/src/layers/basic.jl
@@ -253,3 +253,51 @@ end
 function Base.show(io::IO, b::SkipConnection)
   print(io, "SkipConnection(", b.layers, ", ", b.connection, ")")
 end
+
+"""
+    Parallel(connection, layers...)
+
+Create a 'Parallel' layer that passes an input array to each path in
+`layers`, reducing the output with `connection`.
+
+Called with one input `x`, this is equivalent to `reduce(connection, [l(x) for l in layers])`.
+If called with multiple inputs, they are `zip`ped with the layers, thus `Parallel(+, f, g)(x, y) = f(x) + g(y)`.
+
+# Examples
+
+```jldoctest
+julia> model = Chain(Dense(3, 5),
+                     Parallel(vcat, Dense(5, 4), Chain(Dense(5, 7), Dense(7, 4))),
+                     Dense(8, 17));
+
+julia> size(model(rand(3)))
+(17,)
+
+julia> model = Parallel(+, Dense(10, 2), Dense(5, 2))
+Parallel(+, Dense(10, 2), Dense(5, 2))
+
+julia> size(model(rand(10), rand(5)))
+(2,)
+```
+"""
+struct Parallel{F, T}
+  connection::F
+  layers::T
+end
+
+Parallel(connection, layers...) = Parallel(connection, layers)
+
+@functor Parallel
+
+(m::Parallel)(x::AbstractArray) = mapreduce(f -> f(x), m.connection, m.layers)
+(m::Parallel)(xs::Vararg{<:AbstractArray}) = mapreduce((f, x) -> f(x), m.connection, m.layers, xs)
+(m::Parallel)(xs::Tuple) = m(xs...)
+
+Base.getindex(m::Parallel, i::Integer) = m.layers[i]
+Base.getindex(m::Parallel, i::AbstractVector) = Parallel(m.connection, m.layers[i]...)
+
+function Base.show(io::IO, m::Parallel)
+  print(io, "Parallel(", m.connection, ", ")
+  join(io, m.layers, ", ")
+  print(io, ")")
+end
diff --git a/test/layers/basic.jl b/test/layers/basic.jl
@@ -106,4 +106,21 @@ import Flux: activations
       @test size(SkipConnection(Dense(10,10), (a,b) -> cat(a, b, dims = 2))(input)) == (10,4)
     end
   end
-end
+
+  @testset "Parallel" begin
+    @testset "zero sum" begin
+      input = randn(10, 10, 10, 10)
+      @test Parallel(+, x -> zeros(size(x)), identity)(input) == input
+    end
+
+    @testset "concat size" begin
+      input = randn(10, 2)
+      @test size(Parallel((a, b) -> cat(a, b; dims=2), Dense(10, 10), identity)(input)) == (10, 4)
+    end
+
+    @testset "vararg input" begin
+      inputs = randn(10), randn(5), randn(4)
+      @test size(Parallel(+, Dense(10, 2), Dense(5, 2), Dense(4, 2))(inputs)) == (2,)
+    end
+  end
+end
diff --git a/test/outputsize.jl b/test/outputsize.jl
@@ -28,6 +28,9 @@
 
   m = SkipConnection(Conv((3, 3), 3 => 16; pad = 1), (mx, x) -> cat(mx, x; dims = 3))
   @test outputsize(m, (10, 10, 3, 1)) == (10, 10, 19, 1)
+
+  m = Parallel((mx, x) -> cat(mx, x; dims = 3), Conv((3, 3), 3 => 16; pad = 1), identity)
+  @test outputsize(m, (10, 10, 3, 1)) == (10, 10, 19, 1)
 end
 
 @testset "activations" begin