@@ -275,3 +275,66 @@ For more details on training in the implicit style, see [Flux 0.13.6 documentati
275
275
276
276
For details about the two gradient modes, see [ Zygote's documentation] ( https://fluxml.ai/Zygote.jl/dev/#Explicit-and-Implicit-Parameters-1 ) .
277
277
278
+ ## Regularisation
279
+
280
+ The term * regularisation* covers a wide variety of techniques aiming to improve the
281
+ result of training. This is often done to avoid overfitting.
282
+
283
+ Some of these are can be implemented by simply modifying the loss function.
284
+ L2 or ... umm ... adds to the loss a penalty proportional to ` θ^2 ` for every scalar parameter,
285
+ and for a simple model could be implemented as follows:
286
+
287
+ ``` julia
288
+ Flux. gradient (model) do m
289
+ result = m (input)
290
+ penalty = sum (abs2, m. weight)/ 2 + sum (abs2, m. bias)/ 2
291
+ my_loss (result, label) + 0.42 * penalty
292
+ end
293
+ ```
294
+
295
+ Accessing each individual parameter array by hand won't work well for large models.
296
+ Instead, we can use [ ` Flux.params ` ] ( @ref ) to collect all of them,
297
+ and then apply a function to each one, and sum the result:
298
+
299
+ ``` julia
300
+ pen_l2 (x:: AbstractArray ) = sum (abs2, x)/ 2
301
+
302
+ Flux. gradient (model) do m
303
+ result = m (input)
304
+ penalty = sum (pen_l2, Flux. params (m))
305
+ my_loss (result, label) + 0.42 * penalty
306
+ end
307
+ ```
308
+
309
+ However, the gradient of this penalty term is very simple: It is proportional to the original weights.
310
+ So there is a simpler way to implement exactly the same thing, by modifying the optimiser
311
+ instead of the loss function. This is done by replacing this:
312
+
313
+ ``` julia
314
+ opt = Flux. setup (Adam (0.1 ), model)
315
+ ```
316
+
317
+ with this:
318
+
319
+ ``` julia
320
+ decay_opt = Flux. setup (OptimiserChain (WeightDecay (0.42 ), Adam (0.1 )), model)
321
+ ```
322
+
323
+ Flux's optimisers are really modifications applied to the gradient before using it to update
324
+ the parameters, and ` OptimiserChain ` applies two such modifications.
325
+ The first, [ ` WeightDecay ` ] ( @ref ) adds ` 0.42 ` times original parameter to the gradient,
326
+ matching the gradient of the penalty above (with the same, unrealistically large, constant).
327
+ After that, in either case, [ ` Adam ` ] ( @ref ) computes the final update.
328
+
329
+ The same mechanism can be used for other purposes, such as gradient clipping with [ ` ClipGrad ` ] ( @ref ) .
330
+
331
+ Besides L2 / weight decay, another common and quite different kind of regularisation is
332
+ provided by the [ ` Dropout ` ] (@ref Flux.Dropout) layer. This turns off some ... ??
333
+
334
+ ?? do we discuss test/train mode here too?
335
+
336
+ ## Freezing, Schedules
337
+
338
+ ?? maybe these also fit in here.
339
+
340
+
0 commit comments