Bnb/disc training fix #259

bnb32 · 2025-02-05T18:50:01Z

A running mean of the disc loss is compared to disc loss thresholds for each batch to decide if the discriminator should be trained for that batch. This running mean was not including loss values from batches from the previous epoch, which introduced a jump in the running mean at the start of each epoch. I initially resolved this by just getting the last value in the history but decided that the running mean code was convoluted and needed a rework. This is now handled with a dataframe queue (self._train_record / self._val_record) of loss details for the past N batches, where N is the number of batches per epoch. The running mean is easily computed by calling self._train_record.mean().

… training first batch

… details to compute running means.

… 'train_'

… and moved post batch logging to separate method

grantbuster

Generally i think this refactor makes sense but i think the function stack when calling train() is trending towards massive complexity. I want to challenge you to simplify the stack trace.

grantbuster · 2025-02-19T16:34:47Z

sup3r/models/abstract.py


        return hi_res

+    def _get_hr_exo_and_loss(


I could go either way on this, but knee jerk reaction is that breaking these lines out into a separate function in a different file just makes the stack trace deeper for little benefit. This function is only called in one place in a different file in a relatively short parent function. Seems like we could leave it as-is for less nesting functions? My gut feeling is that three direct function calls without any logic is portable enough to not be packaged into a separate function.

A lot of these extractions are motivated by the work on models with observations. I could delay this until that PR if you prefer.

I guess i would have to see that work too, but i'd be shocked if a 14 line function really helps reduce the burden of 3 function calls? I really think we should just call the 3 functions directly. More nested functions reduces docstring quality and makes it way harder to trace args/kwargs

This removes ~50 lines of duplication in the obs branch but we can decide if it's worth doing in that PR.

grantbuster · 2025-02-19T16:49:26Z

sup3r/models/base.py

+            )
+        return self._val_record.mean(axis=0)
+
+    def _get_batch_loss_details(


There's a lot more going on here than just getting the loss details! This is running a full gradient descent step including updating model parameters. Function name and docstring are misleading.

I'm on the fence on this one, i'm not convinced we need to split this out into its own function. for similar reasons (it's only called once, would we ever call this outside of a training loop?). There are quite a few lines but it's not that complicated and neither is the parent function.

Yeah, good point on naming. Same comment on extraction - in the work on models with observations it's helpful to have this pulled out but I can delay this until that PR.

So now we have _run_gradient_descent and run_gradient_descent and they both take different args and output different things? I don't love that haha. If you simply must have this be a separate function, what about _run_gradient_descent -> _train_batch and then maybe also consider if train_epoch should be hidden _train_epoch to match.

Yeah I like that better

grantbuster · 2025-02-19T16:54:44Z

sup3r/models/base.py

+            gen_too_good = disc_too_bad
+
+            if not self.generator_weights:
+                self.init_weights(batch.low_res.shape, batch.high_res.shape)


Minor simplification - let's move this function call to the start of train() and put the if not self.generator_weights check in init_weights()

…ain` method. content on extractions in comments.

…queue_shape for this.

…s samples. Added `shapes` property to `AbstractBatchQueue` to use for `init_weights`

…train_epoch`.

Bnb/disc training fix

bnb32 added 5 commits February 5, 2025 11:47

fix: epoch always started with disc_loss = 0, resulting in disc never…

76efd8b

… training first batch

only use output features in content loss

96b151a

save / load check for disc training fix

a722236

use n_obs from previous epoch for loss running mean

2808f1f

content loss edit to account for undefined hr_out_features

55774ee

bnb32 requested a review from grantbuster February 6, 2025 16:01

bnb32 added 13 commits February 6, 2025 10:49

adding gen loss from previous epoch for running means

939897f

Using running dataframe records of training and validation batch loss…

58bf727

… details to compute running means.

Initialize records from history for loaded models

b49456f

use index to append loss details record

6777b97

changed trained_frac naming - doesn't make sense to prefix these with…

6069637

… 'train_'

missed new method get_hr_exo_and_loss from cherry picks

cdd1edf

added loss_mean_window arg, material derivative loss with extremes,…

acb04b2

… and moved post batch logging to separate method

fix: wasn't updating running mean loss_details

b03a539

material derivative loss test fix

41ce39b

typo

f26dc86

use epoch mean in history but window for running mean

cee57bd

history indexing catch

9d6a51d

floating point error adjustment

71c6ef1

grantbuster requested changes Feb 19, 2025

View reviewed changes

bnb32 added 4 commits February 19, 2025 11:08

renaming _get_batch_loss_details. moved weight into to start of `tr…

2cb320b

…ain` method. content on extractions in comments.

Moving weight initialization to start of training. Use batch_handler.…

7029740

…queue_shape for this.

Using queue_shape doesn't work since some queues only include high re…

cb57ef3

…s samples. Added `shapes` property to `AbstractBatchQueue` to use for `init_weights`

Removed _get_hr_exo_and_loss. Renamed _run_gradient_descent and `…

48cbd5c

…train_epoch`.

grantbuster approved these changes Feb 21, 2025

View reviewed changes

bnb32 merged commit dfee45d into main Feb 21, 2025
12 checks passed

bnb32 deleted the bnb/disc_training_fix branch February 21, 2025 23:02

github-actions bot pushed a commit that referenced this pull request Feb 21, 2025

Merge pull request #259 from NREL/bnb/disc_training_fix

2337183

Bnb/disc training fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bnb/disc training fix #259

Bnb/disc training fix #259

Uh oh!

bnb32 commented Feb 5, 2025 •

edited

Loading

Uh oh!

grantbuster left a comment

Uh oh!

grantbuster Feb 19, 2025

Uh oh!

bnb32 Feb 19, 2025

Uh oh!

grantbuster Feb 21, 2025

Uh oh!

bnb32 Feb 21, 2025

Uh oh!

grantbuster Feb 19, 2025

Uh oh!

bnb32 Feb 19, 2025

Uh oh!

grantbuster Feb 21, 2025

Uh oh!

bnb32 Feb 21, 2025

Uh oh!

grantbuster Feb 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bnb/disc training fix #259

Bnb/disc training fix #259

Uh oh!

Conversation

bnb32 commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grantbuster left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bnb32 commented Feb 5, 2025 •

edited

Loading