Skip to content

Commit d1cfed7

Browse files
committed
replace ADAM with Adam and its variants thereof
1 parent 0b01b77 commit d1cfed7

File tree

7 files changed

+76
-76
lines changed

7 files changed

+76
-76
lines changed

docs/src/models/recurrence.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ Flux.reset!(m)
173173
[m(x) for x in seq_init]
174174

175175
ps = Flux.params(m)
176-
opt= ADAM(1e-3)
176+
opt= Adam(1e-3)
177177
Flux.train!(loss, ps, data, opt)
178178
```
179179

docs/src/saving.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,6 @@ You can store the optimiser state alongside the model, to resume training
135135
exactly where you left off. BSON is smart enough to [cache values](https://github.com/JuliaIO/BSON.jl/blob/v0.3.4/src/write.jl#L71) and insert links when saving, but only if it knows everything to be saved up front. Thus models and optimizers must be saved together to have the latter work after restoring.
136136

137137
```julia
138-
opt = ADAM()
138+
opt = Adam()
139139
@save "model-$(now()).bson" model opt
140140
```

docs/src/training/optimisers.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ for p in (W, b)
3939
end
4040
```
4141

42-
An optimiser `update!` accepts a parameter and a gradient, and updates the parameter according to the chosen rule. We can also pass `opt` to our [training loop](training.md), which will update all parameters of the model in a loop. However, we can now easily replace `Descent` with a more advanced optimiser such as `ADAM`.
42+
An optimiser `update!` accepts a parameter and a gradient, and updates the parameter according to the chosen rule. We can also pass `opt` to our [training loop](training.md), which will update all parameters of the model in a loop. However, we can now easily replace `Descent` with a more advanced optimiser such as `Adam`.
4343

4444
## Optimiser Reference
4545

@@ -51,15 +51,15 @@ Descent
5151
Momentum
5252
Nesterov
5353
RMSProp
54-
ADAM
55-
RADAM
54+
Adam
55+
RAdam
5656
AdaMax
57-
ADAGrad
58-
ADADelta
57+
AdaGrad
58+
AdaDelta
5959
AMSGrad
60-
NADAM
61-
ADAMW
62-
OADAM
60+
NAdam
61+
AdamW
62+
OAdam
6363
AdaBelief
6464
```
6565

@@ -182,7 +182,7 @@ WeightDecay
182182
Gradient clipping is useful for training recurrent neural networks, which have a tendency to suffer from the exploding gradient problem. An example usage is
183183

184184
```julia
185-
opt = Optimiser(ClipValue(1e-3), ADAM(1e-3))
185+
opt = Optimiser(ClipValue(1e-3), Adam(1e-3))
186186
```
187187

188188
```@docs

src/Flux.jl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,9 @@ include("optimise/Optimise.jl")
2929
using .Optimise
3030
using .Optimise: @epochs
3131
using .Optimise: skip
32-
export Descent, ADAM, Momentum, Nesterov, RMSProp,
33-
ADAGrad, AdaMax, ADADelta, AMSGrad, NADAM, OADAM,
34-
ADAMW, RADAM, AdaBelief, InvDecay, ExpDecay,
32+
export Descent, Adam, Momentum, Nesterov, RMSProp,
33+
AdaGrad, AdaMax, AdaDelta, AMSGrad, NAdam, OAdam,
34+
AdamW, RAdam, AdaBelief, InvDecay, ExpDecay,
3535
WeightDecay, ClipValue, ClipNorm
3636

3737
using CUDA

src/optimise/Optimise.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ using LinearAlgebra
44
import ArrayInterface
55

66
export train!, update!,
7-
Descent, ADAM, Momentum, Nesterov, RMSProp,
8-
ADAGrad, AdaMax, ADADelta, AMSGrad, NADAM, ADAMW,RADAM, OADAM, AdaBelief,
7+
Descent, Adam, Momentum, Nesterov, RMSProp,
8+
AdaGrad, AdaMax, AdaDelta, AMSGrad, NAdam, AdamW,RAdam, OAdam, AdaBelief,
99
InvDecay, ExpDecay, WeightDecay, stop, skip, Optimiser,
1010
ClipValue, ClipNorm
1111

src/optimise/optimisers.jl

Lines changed: 55 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -147,9 +147,9 @@ function apply!(o::RMSProp, x, Δ)
147147
end
148148

149149
"""
150-
ADAM(η = 0.001, β::Tuple = (0.9, 0.999), ϵ = $EPS)
150+
Adam(η = 0.001, β::Tuple = (0.9, 0.999), ϵ = $EPS)
151151
152-
[ADAM](https://arxiv.org/abs/1412.6980) optimiser.
152+
[Adam](https://arxiv.org/abs/1412.6980) optimiser.
153153
154154
# Parameters
155155
- Learning rate (`η`): Amount by which gradients are discounted before updating
@@ -159,21 +159,21 @@ end
159159
160160
# Examples
161161
```julia
162-
opt = ADAM()
162+
opt = Adam()
163163
164-
opt = ADAM(0.001, (0.9, 0.8))
164+
opt = Adam(0.001, (0.9, 0.8))
165165
```
166166
"""
167-
mutable struct ADAM <: AbstractOptimiser
167+
mutable struct Adam <: AbstractOptimiser
168168
eta::Float64
169169
beta::Tuple{Float64,Float64}
170170
epsilon::Float64
171171
state::IdDict{Any, Any}
172172
end
173-
ADAM::Real = 0.001, β::Tuple = (0.9, 0.999), ϵ::Real = EPS) = ADAM(η, β, ϵ, IdDict())
174-
ADAM::Real, β::Tuple, state::IdDict) = ADAM(η, β, EPS, state)
173+
Adam::Real = 0.001, β::Tuple = (0.9, 0.999), ϵ::Real = EPS) = Adam(η, β, ϵ, IdDict())
174+
Adam::Real, β::Tuple, state::IdDict) = Adam(η, β, EPS, state)
175175

176-
function apply!(o::ADAM, x, Δ)
176+
function apply!(o::Adam, x, Δ)
177177
η, β = o.eta, o.beta
178178

179179
mt, vt, βp = get!(o.state, x) do
@@ -189,9 +189,9 @@ function apply!(o::ADAM, x, Δ)
189189
end
190190

191191
"""
192-
RADAM(η = 0.001, β::Tuple = (0.9, 0.999), ϵ = $EPS)
192+
RAdam(η = 0.001, β::Tuple = (0.9, 0.999), ϵ = $EPS)
193193
194-
[Rectified ADAM](https://arxiv.org/abs/1908.03265) optimizer.
194+
[Rectified Adam](https://arxiv.org/abs/1908.03265) optimizer.
195195
196196
# Parameters
197197
- Learning rate (`η`): Amount by which gradients are discounted before updating
@@ -201,21 +201,21 @@ end
201201
202202
# Examples
203203
```julia
204-
opt = RADAM()
204+
opt = RAdam()
205205
206-
opt = RADAM(0.001, (0.9, 0.8))
206+
opt = RAdam(0.001, (0.9, 0.8))
207207
```
208208
"""
209-
mutable struct RADAM <: AbstractOptimiser
209+
mutable struct RAdam <: AbstractOptimiser
210210
eta::Float64
211211
beta::Tuple{Float64,Float64}
212212
epsilon::Float64
213213
state::IdDict{Any, Any}
214214
end
215-
RADAM::Real = 0.001, β::Tuple = (0.9, 0.999), ϵ::Real = EPS) = RADAM(η, β, ϵ, IdDict())
216-
RADAM::Real, β::Tuple, state::IdDict) = RADAM(η, β, EPS, state)
215+
RAdam::Real = 0.001, β::Tuple = (0.9, 0.999), ϵ::Real = EPS) = RAdam(η, β, ϵ, IdDict())
216+
RAdam::Real, β::Tuple, state::IdDict) = RAdam(η, β, EPS, state)
217217

218-
function apply!(o::RADAM, x, Δ)
218+
function apply!(o::RAdam, x, Δ)
219219
η, β = o.eta, o.beta
220220
ρ∞ = 2/(1-β[2])-1
221221

@@ -241,7 +241,7 @@ end
241241
"""
242242
AdaMax(η = 0.001, β::Tuple = (0.9, 0.999), ϵ = $EPS)
243243
244-
[AdaMax](https://arxiv.org/abs/1412.6980) is a variant of ADAM based on the ∞-norm.
244+
[AdaMax](https://arxiv.org/abs/1412.6980) is a variant of Adam based on the ∞-norm.
245245
246246
# Parameters
247247
- Learning rate (`η`): Amount by which gradients are discounted before updating
@@ -281,10 +281,10 @@ function apply!(o::AdaMax, x, Δ)
281281
end
282282

283283
"""
284-
OADAM(η = 0.0001, β::Tuple = (0.5, 0.9), ϵ = $EPS)
284+
OAdam(η = 0.0001, β::Tuple = (0.5, 0.9), ϵ = $EPS)
285285
286-
[OADAM](https://arxiv.org/abs/1711.00141) (Optimistic ADAM)
287-
is a variant of ADAM adding an "optimistic" term suitable for adversarial training.
286+
[OAdam](https://arxiv.org/abs/1711.00141) (Optimistic Adam)
287+
is a variant of Adam adding an "optimistic" term suitable for adversarial training.
288288
289289
# Parameters
290290
- Learning rate (`η`): Amount by which gradients are discounted before updating
@@ -294,21 +294,21 @@ is a variant of ADAM adding an "optimistic" term suitable for adversarial traini
294294
295295
# Examples
296296
```julia
297-
opt = OADAM()
297+
opt = OAdam()
298298
299-
opt = OADAM(0.001, (0.9, 0.995))
299+
opt = OAdam(0.001, (0.9, 0.995))
300300
```
301301
"""
302-
mutable struct OADAM <: AbstractOptimiser
302+
mutable struct OAdam <: AbstractOptimiser
303303
eta::Float64
304304
beta::Tuple{Float64,Float64}
305305
epsilon::Float64
306306
state::IdDict{Any, Any}
307307
end
308-
OADAM::Real = 0.001, β::Tuple = (0.5, 0.9), ϵ::Real = EPS) = OADAM(η, β, ϵ, IdDict())
309-
OADAM::Real, β::Tuple, state::IdDict) = RMSProp(η, β, EPS, state)
308+
OAdam::Real = 0.001, β::Tuple = (0.5, 0.9), ϵ::Real = EPS) = OAdam(η, β, ϵ, IdDict())
309+
OAdam::Real, β::Tuple, state::IdDict) = RMSProp(η, β, EPS, state)
310310

311-
function apply!(o::OADAM, x, Δ)
311+
function apply!(o::OAdam, x, Δ)
312312
η, β = o.eta, o.beta
313313

314314
mt, vt, Δ_, βp = get!(o.state, x) do
@@ -326,9 +326,9 @@ function apply!(o::OADAM, x, Δ)
326326
end
327327

328328
"""
329-
ADAGrad(η = 0.1, ϵ = $EPS)
329+
AdaGrad(η = 0.1, ϵ = $EPS)
330330
331-
[ADAGrad](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) optimizer. It has
331+
[AdaGrad](http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf) optimizer. It has
332332
parameter specific learning rates based on how frequently it is updated.
333333
Parameters don't need tuning.
334334
@@ -338,20 +338,20 @@ Parameters don't need tuning.
338338
339339
# Examples
340340
```julia
341-
opt = ADAGrad()
341+
opt = AdaGrad()
342342
343-
opt = ADAGrad(0.001)
343+
opt = AdaGrad(0.001)
344344
```
345345
"""
346-
mutable struct ADAGrad <: AbstractOptimiser
346+
mutable struct AdaGrad <: AbstractOptimiser
347347
eta::Float64
348348
epsilon::Float64
349349
acc::IdDict
350350
end
351-
ADAGrad::Real = 0.1, ϵ::Real = EPS) = ADAGrad(η, ϵ, IdDict())
352-
ADAGrad::Real, state::IdDict) = ADAGrad(η, EPS, state)
351+
AdaGrad::Real = 0.1, ϵ::Real = EPS) = AdaGrad(η, ϵ, IdDict())
352+
AdaGrad::Real, state::IdDict) = AdaGrad(η, EPS, state)
353353

354-
function apply!(o::ADAGrad, x, Δ)
354+
function apply!(o::AdaGrad, x, Δ)
355355
η = o.eta
356356
acc = get!(() -> fill!(similar(x), o.epsilon), o.acc, x)::typeof(x)
357357
@. acc += Δ * conj(Δ)
@@ -361,7 +361,7 @@ end
361361
"""
362362
ADADelta(ρ = 0.9, ϵ = $EPS)
363363
364-
[ADADelta](https://arxiv.org/abs/1212.5701) is a version of ADAGrad adapting its learning
364+
[ADADelta](https://arxiv.org/abs/1212.5701) is a version of AdaGrad adapting its learning
365365
rate based on a window of past gradient updates.
366366
Parameters don't need tuning.
367367
@@ -397,7 +397,7 @@ end
397397
"""
398398
AMSGrad(η = 0.001, β::Tuple = (0.9, 0.999), ϵ = $EPS)
399399
400-
The [AMSGrad](https://openreview.net/forum?id=ryQu7f-RZ) version of the ADAM
400+
The [AMSGrad](https://openreview.net/forum?id=ryQu7f-RZ) version of the Adam
401401
optimiser. Parameters don't need tuning.
402402
403403
# Parameters
@@ -436,9 +436,9 @@ function apply!(o::AMSGrad, x, Δ)
436436
end
437437

438438
"""
439-
NADAM(η = 0.001, β::Tuple = (0.9, 0.999), ϵ = $EPS)
439+
NAdam(η = 0.001, β::Tuple = (0.9, 0.999), ϵ = $EPS)
440440
441-
[NADAM](https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ) is a Nesterov variant of ADAM.
441+
[NAdam](https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ) is a Nesterov variant of Adam.
442442
Parameters don't need tuning.
443443
444444
# Parameters
@@ -449,21 +449,21 @@ Parameters don't need tuning.
449449
450450
# Examples
451451
```julia
452-
opt = NADAM()
452+
opt = NAdam()
453453
454-
opt = NADAM(0.002, (0.89, 0.995))
454+
opt = NAdam(0.002, (0.89, 0.995))
455455
```
456456
"""
457-
mutable struct NADAM <: AbstractOptimiser
457+
mutable struct NAdam <: AbstractOptimiser
458458
eta::Float64
459459
beta::Tuple{Float64, Float64}
460460
epsilon::Float64
461461
state::IdDict{Any, Any}
462462
end
463-
NADAM::Real = 0.001, β = (0.9, 0.999), ϵ::Real = EPS) = NADAM(η, β, ϵ, IdDict())
464-
NADAM::Real, β::Tuple, state::IdDict) = NADAM(η, β, EPS, state)
463+
NAdam::Real = 0.001, β = (0.9, 0.999), ϵ::Real = EPS) = NAdam(η, β, ϵ, IdDict())
464+
NAdam::Real, β::Tuple, state::IdDict) = NAdam(η, β, EPS, state)
465465

466-
function apply!(o::NADAM, x, Δ)
466+
function apply!(o::NAdam, x, Δ)
467467
η, β = o.eta, o.beta
468468

469469
mt, vt, βp = get!(o.state, x) do
@@ -480,9 +480,9 @@ function apply!(o::NADAM, x, Δ)
480480
end
481481

482482
"""
483-
ADAMW(η = 0.001, β::Tuple = (0.9, 0.999), decay = 0)
483+
AdamW(η = 0.001, β::Tuple = (0.9, 0.999), decay = 0)
484484
485-
[ADAMW](https://arxiv.org/abs/1711.05101) is a variant of ADAM fixing (as in repairing) its
485+
[AdamW](https://arxiv.org/abs/1711.05101) is a variant of Adam fixing (as in repairing) its
486486
weight decay regularization.
487487
488488
# Parameters
@@ -494,19 +494,19 @@ weight decay regularization.
494494
495495
# Examples
496496
```julia
497-
opt = ADAMW()
497+
opt = AdamW()
498498
499-
opt = ADAMW(0.001, (0.89, 0.995), 0.1)
499+
opt = AdamW(0.001, (0.89, 0.995), 0.1)
500500
```
501501
"""
502-
ADAMW= 0.001, β = (0.9, 0.999), decay = 0) =
503-
Optimiser(ADAM(η, β), WeightDecay(decay))
502+
AdamW= 0.001, β = (0.9, 0.999), decay = 0) =
503+
Optimiser(Adam(η, β), WeightDecay(decay))
504504

505505
"""
506506
AdaBelief(η = 0.001, β::Tuple = (0.9, 0.999), ϵ = $EPS)
507507
508508
The [AdaBelief](https://arxiv.org/abs/2010.07468) optimiser is a variant of the well-known
509-
ADAM optimiser.
509+
Adam optimiser.
510510
511511
# Parameters
512512
- Learning rate (`η`): Amount by which gradients are discounted before updating
@@ -537,7 +537,7 @@ function apply!(o::AdaBelief, x, Δ)
537537
(zero(x), zero(x), Float64[β[1], β[2]])
538538
end :: Tuple{typeof(x), typeof(x), Vector{Float64}}
539539

540-
#= st is a variance and can go to zero. This is in contrast to ADAM, which uses the
540+
#= st is a variance and can go to zero. This is in contrast to Adam, which uses the
541541
second moment which is usually far enough from zero. This is problematic, since st
542542
can be slightly negative due to numerical error, and the square root below will fail.
543543
Also, if we want to differentiate through the optimizer, √0 is not differentiable.
@@ -643,10 +643,10 @@ for more general scheduling techniques.
643643
`ExpDecay` is typically composed with other optimizers
644644
as the last transformation of the gradient:
645645
```julia
646-
opt = Optimiser(ADAM(), ExpDecay(1.0))
646+
opt = Optimiser(Adam(), ExpDecay(1.0))
647647
```
648648
Note: you may want to start with `η=1` in `ExpDecay` when combined with other
649-
optimizers (`ADAM` in this case) that have their own learning rate.
649+
optimizers (`Adam` in this case) that have their own learning rate.
650650
"""
651651
mutable struct ExpDecay <: AbstractOptimiser
652652
eta::Float64
@@ -681,7 +681,7 @@ with coefficient ``λ`` to the loss.
681681
# Examples
682682
683683
```julia
684-
opt = Optimiser(WeightDecay(1f-4), ADAM())
684+
opt = Optimiser(WeightDecay(1f-4), Adam())
685685
```
686686
"""
687687
mutable struct WeightDecay <: AbstractOptimiser

0 commit comments

Comments
 (0)