Skip to content

Conversation

@ranjodhsingh1729
Copy link

Closes: #437

This PR:

  • Implements DeltaBarDeltaUpdate
  • Refactors GradientDescent to GradientDescentType<UpdatePolicyType, DecayPolicyType>
  • Defines DeltaBarDelta as a typedef to GradientDescentType<DeltaBarDeltaUpdate, NoDecay>

GradientDescent is now a typedef to GradientDescentType<VanillaUpdate, NoDecay>.

@ranjodhsingh1729 ranjodhsingh1729 force-pushed the deltabardelta branch 2 times, most recently from 7a0f03c to 808cabb Compare September 20, 2025 11:24
@ranjodhsingh1729
Copy link
Author

Hi,
would it make sense to include a change here which allows PrintLoss callback to work with GradientDescentType (since ProgressBar() would require more invasive changes). It doesn't currently as PrintLoss prints when EndEpoch is called and GradientDescentType never calls EndEpoch.

@ranjodhsingh1729
Copy link
Author

@conradsnicta could you please approove the CI build i can't test this with bandicoot locally.

@zoq
Copy link
Member

zoq commented Oct 12, 2025

Had to restart the GPU runner, restarted the jobs.

@ranjodhsingh1729
Copy link
Author

Thanks @zoq
The Tests ran successfully.

Copy link
Member

@zoq zoq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, can you also update the HISTORY.md file.

@ranjodhsingh1729
Copy link
Author

Hi

Fantastic, can you also update the HISTORY.md file.

Done.

Hi, would it make sense to include a change here which allows PrintLoss callback to work with GradientDescentType (since ProgressBar() would require more invasive changes). It doesn't currently as PrintLoss prints when EndEpoch is called and GradientDescentType never calls EndEpoch.

What do you think about this.

Additionally I found this small inconsistency.
If maxIterations is non-zero it will perform one less iteration.

for (size_t i = 1; i != maxIterations && !terminate; ++i)

Should I go ahead and fix these here? OR seperately?

@ranjodhsingh1729
Copy link
Author

ranjodhsingh1729 commented Oct 14, 2025

In general, fixes are orthogonal to new features, and should be kept separated.

Thanks @zoq for the great advice.
It made sense the moment i read it. :)

I think this is ready for review then.
This isn't urgent or anything whenever you(or anyone else) can find the time will be good.

@conradsnicta conradsnicta requested a review from zoq October 17, 2025 15:58
@rcurtin
Copy link
Member

rcurtin commented Oct 17, 2025

It is definitely not okay to merge an optimizer that has not been thoroughly reviewed. I will get to it when I have a chance and there is no schedule on when that will be.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @ranjodhsingh1729. I have some concerns about the correctness of the implementation---can you look into the comments I left? Let me know if I can clarify anything, or if I missed anything. I used primarily section 6 of the paper for my review.

for all Lagrange multipliers and 10 is used as the initial penalty parameter.

</details>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, nice catch!

#### See also:
* [Increased rates of convergence through learning rate adaptation](https://www.academia.edu/download/32005051/Jacobs.NN88.pdf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* [Increased rates of convergence through learning rate adaptation](https://www.academia.edu/download/32005051/Jacobs.NN88.pdf)
* [Increased rates of convergence through learning rate adaptation (pdf)](https://www.academia.edu/download/32005051/Jacobs.NN88.pdf)

Most of the other links indicate to people when a PDF is going to be opened/downloaded.

Also it's probably worth linking to GradientDescent here too.


Note that `DeltaBarDelta` is based on the templated type
`GradientDescentType<`_`UpdatePolicyType, DecayPolicyType`_`>` with _`UpdatePolicyType`_` =
`DeltaBarDeltaUpdate` and _`DecayPolicyType`_` = NoDecay`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because DeltaBarDelta always uses the DeltaBarDeltaUpdate class, it doesn't really make sense to the user to provide a constructor that takes updatePolicy, decayPolicy, and resetPolicy. Of course an advanced user can do this if they take a look at the internals of the GradientDescent class. But for the typical user, they just want to set the settings of DeltaBarDelta and move on. So I would suggest adding a wrapper class DeltaBarDelta that just calls GradientDescentType<DeltaBarDeltaUpdate, NoDecay> internally (there are many of these for SGD variants), and also provides a constructor that forwards the parameters of DeltaBarDeltaUpdate:

DeltaBarDelta(stepSize, maxIterations, tolerance, kappa, phi, momentum, minGain, resetPolicy)

Then the table of options below will be much simplified too.

//! Get the step size decay policy.
const DecayPolicyType& DecayPolicy() const { return decayPolicy; }
//! Modify the step size decay policy.
DecayPolicyType& DecayPolicy() { return decayPolicy; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be one for ResetPolicy() too?

Comment on lines 30 to 33
const bool resetPolicy)
: stepSize(stepSize), maxIterations(maxIterations), tolerance(tolerance),
updatePolicy(updatePolicy), decayPolicy(decayPolicy),
resetPolicy(resetPolicy), isInitialized(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const bool resetPolicy)
: stepSize(stepSize), maxIterations(maxIterations), tolerance(tolerance),
updatePolicy(updatePolicy), decayPolicy(decayPolicy),
resetPolicy(resetPolicy), isInitialized(false)
const bool resetPolicy) :
stepSize(stepSize),
maxIterations(maxIterations),
tolerance(tolerance),
updatePolicy(updatePolicy),
decayPolicy(decayPolicy),
resetPolicy(resetPolicy),
isInitialized(false)

Just a style fix.

DeltaBarDelta s(0.001, 0, 1e-15);
FunctionTest<RosenbrockFunction, TestType>(s, 0.01, 0.001);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add one more test for the logistic regression test function? That is a really simple test but it is a good sanity check for machine learning applications. It should be really easy; here's the Adam logistic regression function test:

TEMPLATE_TEST_CASE("Adam_LogisticRegressionFunction", "[Adam]",
    ENS_ALL_TEST_TYPES)
{
  Adam adam(0.032);
  LogisticRegressionFunctionTest<TestType>(adam);
}

@@ -0,0 +1,51 @@
/**
* @file delta_bar_detla_test.cpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @file delta_bar_detla_test.cpp
* @file delta_bar_delta_test.cpp

const MatType mask = conv_to<MatType>::from((gradient % velocity) < 0.0);

gains += mask * parent.Kappa();
gains %= mask + (1 - mask) * parent.Phi();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my read of Eq. (4) in the paper I'm not sure this is correct. The value of gains[i] should not change if (gradient % velocity)[i] is 0; but given how mask is set, I can't quickly see that this is correct.

I would suggest maybe using sign() instead to create a matrix indicating whether the element is positive, negative, or zero, and then use that matrix to apply the update rule in (4).


velocity *= parent.momentum;
velocity -= (stepSize * gains) % gradient;
iterate += velocity;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the paper seems to indicate we update with ϵ(t + 1) in Eq. (3), not ẟ'(t)?

I would suggest using the same terminology as the paper throughout even if it is not intuitive what ẟ', ϵ, and θ actually mean---you can clarify that part in the comments.

gains.clamp(parent.MinimumGain(), arma::datum::inf);

velocity *= parent.momentum;
velocity -= (stepSize * gains) % gradient;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this implement the exponential average ẟ'(t) = (1 - θ) ẟ(t) + θ ẟ'(t - 1)? This doesn't appear to be that.

@ranjodhsingh1729
Copy link
Author

@rcurtin, Thanks a lot for the thorough review - turns out i've made quite a mistake here.

Since both t-SNE papers (the original and the tree-based approximation) state using Jacobs's scheme and cite this paper, I assumed it was used as-is and didn't verifying that assumption.

Now i know that what is used in t-SNE was a modified version of the DBD rule that incorporates momentum, while the original DBD only includes momentum for 'comparison' and doesn't apply it in updates.

Glad this was caught during review - definitely a reminder not to rush things like this in the future. 😅

Since there's already an open issue, PR, and review, I've gone ahead and implemented the actual delta-bar-delta rule and also addressed the review suggestions.

So I guess this puts me back at square one - I'll need to implement the t-SNE update rule in mlpack and hook it up with the refactored gradient descent class. Since it's not really used elsewhere, adding it to ensmallen probably doesn't make sense. Does that sound right, or is there a cleaner way to handle it?

And sorry for the confusion here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Jacob's Delta-Bar-Delta Update Rule For GradientDescent Optimizer

3 participants