-
Couldn't load subscription status.
- Fork 132
Implement DeltaBarDelta using refactored GradientDescent.
#440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
7a0f03c to
808cabb
Compare
|
Hi, |
|
@conradsnicta could you please approove the CI build i can't test this with bandicoot locally. |
|
Had to restart the GPU runner, restarted the jobs. |
|
Thanks @zoq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic, can you also update the HISTORY.md file.
|
Hi
Done.
What do you think about this. Additionally I found this small inconsistency.
Should I go ahead and fix these here? OR seperately? |
540cdcd to
9ef2543
Compare
Thanks @zoq for the great advice. I think this is ready for review then. |
|
It is definitely not okay to merge an optimizer that has not been thoroughly reviewed. I will get to it when I have a chance and there is no schedule on when that will be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @ranjodhsingh1729. I have some concerns about the correctness of the implementation---can you look into the comments I left? Let me know if I can clarify anything, or if I missed anything. I used primarily section 6 of the paper for my review.
| for all Lagrange multipliers and 10 is used as the initial penalty parameter. | ||
|
|
||
| </details> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, nice catch!
doc/optimizers.md
Outdated
| #### See also: | ||
| * [Increased rates of convergence through learning rate adaptation](https://www.academia.edu/download/32005051/Jacobs.NN88.pdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * [Increased rates of convergence through learning rate adaptation](https://www.academia.edu/download/32005051/Jacobs.NN88.pdf) | |
| * [Increased rates of convergence through learning rate adaptation (pdf)](https://www.academia.edu/download/32005051/Jacobs.NN88.pdf) |
Most of the other links indicate to people when a PDF is going to be opened/downloaded.
Also it's probably worth linking to GradientDescent here too.
|
|
||
| Note that `DeltaBarDelta` is based on the templated type | ||
| `GradientDescentType<`_`UpdatePolicyType, DecayPolicyType`_`>` with _`UpdatePolicyType`_` = | ||
| `DeltaBarDeltaUpdate` and _`DecayPolicyType`_` = NoDecay`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because DeltaBarDelta always uses the DeltaBarDeltaUpdate class, it doesn't really make sense to the user to provide a constructor that takes updatePolicy, decayPolicy, and resetPolicy. Of course an advanced user can do this if they take a look at the internals of the GradientDescent class. But for the typical user, they just want to set the settings of DeltaBarDelta and move on. So I would suggest adding a wrapper class DeltaBarDelta that just calls GradientDescentType<DeltaBarDeltaUpdate, NoDecay> internally (there are many of these for SGD variants), and also provides a constructor that forwards the parameters of DeltaBarDeltaUpdate:
DeltaBarDelta(stepSize, maxIterations, tolerance, kappa, phi, momentum, minGain, resetPolicy)
Then the table of options below will be much simplified too.
| //! Get the step size decay policy. | ||
| const DecayPolicyType& DecayPolicy() const { return decayPolicy; } | ||
| //! Modify the step size decay policy. | ||
| DecayPolicyType& DecayPolicy() { return decayPolicy; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there be one for ResetPolicy() too?
| const bool resetPolicy) | ||
| : stepSize(stepSize), maxIterations(maxIterations), tolerance(tolerance), | ||
| updatePolicy(updatePolicy), decayPolicy(decayPolicy), | ||
| resetPolicy(resetPolicy), isInitialized(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const bool resetPolicy) | |
| : stepSize(stepSize), maxIterations(maxIterations), tolerance(tolerance), | |
| updatePolicy(updatePolicy), decayPolicy(decayPolicy), | |
| resetPolicy(resetPolicy), isInitialized(false) | |
| const bool resetPolicy) : | |
| stepSize(stepSize), | |
| maxIterations(maxIterations), | |
| tolerance(tolerance), | |
| updatePolicy(updatePolicy), | |
| decayPolicy(decayPolicy), | |
| resetPolicy(resetPolicy), | |
| isInitialized(false) |
Just a style fix.
| DeltaBarDelta s(0.001, 0, 1e-15); | ||
| FunctionTest<RosenbrockFunction, TestType>(s, 0.01, 0.001); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add one more test for the logistic regression test function? That is a really simple test but it is a good sanity check for machine learning applications. It should be really easy; here's the Adam logistic regression function test:
TEMPLATE_TEST_CASE("Adam_LogisticRegressionFunction", "[Adam]",
ENS_ALL_TEST_TYPES)
{
Adam adam(0.032);
LogisticRegressionFunctionTest<TestType>(adam);
}
tests/delta_bar_delta_test.cpp
Outdated
| @@ -0,0 +1,51 @@ | |||
| /** | |||
| * @file delta_bar_detla_test.cpp | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * @file delta_bar_detla_test.cpp | |
| * @file delta_bar_delta_test.cpp |
| const MatType mask = conv_to<MatType>::from((gradient % velocity) < 0.0); | ||
|
|
||
| gains += mask * parent.Kappa(); | ||
| gains %= mask + (1 - mask) * parent.Phi(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my read of Eq. (4) in the paper I'm not sure this is correct. The value of gains[i] should not change if (gradient % velocity)[i] is 0; but given how mask is set, I can't quickly see that this is correct.
I would suggest maybe using sign() instead to create a matrix indicating whether the element is positive, negative, or zero, and then use that matrix to apply the update rule in (4).
|
|
||
| velocity *= parent.momentum; | ||
| velocity -= (stepSize * gains) % gradient; | ||
| iterate += velocity; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the paper seems to indicate we update with ϵ(t + 1) in Eq. (3), not ẟ'(t)?
I would suggest using the same terminology as the paper throughout even if it is not intuitive what ẟ', ϵ, and θ actually mean---you can clarify that part in the comments.
| gains.clamp(parent.MinimumGain(), arma::datum::inf); | ||
|
|
||
| velocity *= parent.momentum; | ||
| velocity -= (stepSize * gains) % gradient; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this implement the exponential average ẟ'(t) = (1 - θ) ẟ(t) + θ ẟ'(t - 1)? This doesn't appear to be that.
|
@rcurtin, Thanks a lot for the thorough review - turns out i've made quite a mistake here. Since both t-SNE papers (the original and the tree-based approximation) state using Jacobs's scheme and cite this paper, I assumed it was used as-is and didn't verifying that assumption. Now i know that what is used in t-SNE was a modified version of the DBD rule that incorporates momentum, while the original DBD only includes momentum for 'comparison' and doesn't apply it in updates. Glad this was caught during review - definitely a reminder not to rush things like this in the future. 😅 Since there's already an open issue, PR, and review, I've gone ahead and implemented the actual delta-bar-delta rule and also addressed the review suggestions. So I guess this puts me back at square one - I'll need to implement the t-SNE update rule in mlpack and hook it up with the refactored gradient descent class. Since it's not really used elsewhere, adding it to ensmallen probably doesn't make sense. Does that sound right, or is there a cleaner way to handle it? And sorry for the confusion here. |
Closes: #437
This PR:
DeltaBarDeltaUpdateGradientDescenttoGradientDescentType<UpdatePolicyType, DecayPolicyType>DeltaBarDeltaas a typedef toGradientDescentType<DeltaBarDeltaUpdate, NoDecay>GradientDescentis now a typedef toGradientDescentType<VanillaUpdate, NoDecay>.