Support gradient accumulation #1238

janEbert · 2025-05-29T11:44:56Z

First, the batched backward calculation is refactored into its own function. Then, gradient accumulation is implemented by moving the data iterator inside the train_step method and consuming data from it as necessary. I added some extra handling for non-infinite data iterators, but if you dislike that additional complexity, I can remove it to simplify the code.

The feature is enabled by giving an additional --training.global_batch_size, which has a sensible default of 1 gradient accumulation step (i.e., no actual accumulation).

@tianyu-l thanks for the ping.

janEbert · 2025-05-29T11:46:13Z

Maybe it would also make sense to rename --training.batch_size to --training.local_batch_size accordingly to differentiate it further from the --training.global_batch_size config.

fegin

Thanks for the PR. I suggest that we don't let train_step() be aware of data_iterator. Please see the detail comments.

Also, this PR doesn't change the parallelization, which is not correct. We will have to call set_requires_gradient_sync if FSDP is applied. We can raise an exception if DDP is used and accumulation_steps > 1 for now.

torchtitan/train.py

fegin · 2025-05-30T05:59:16Z

torchtitan/train.py

+        unwrapped_loss_fn = self.loss_fn
+
+        @functools.wraps(unwrapped_loss_fn)
+        def accumulated_loss_fn(*args, **kwargs):


We should just modify build_loss_fn to take accmulation_steps to let the loss function decide the usage.

I'm OK either way.
I think being more explicit about grad accumulation handling doesn't look bad.
Also if we go with explicit global_batch_size and implicit grad_accu_steps, then we'll need to do another check & computation in the loss function.

I moved the wrapping functionality to torchtitan.components.loss, called it rescale_accumulated_loss. Not quite like what you wanted, but that way we can re-use the Trainer.gradient_accumulation_step value more easily.

torchtitan/train.py

tianyu-l

Thank you for adding this feature!
I left several comments. Please see if they make sense.

tianyu-l · 2025-05-30T05:40:52Z

torchtitan/config_manager.py

@@ -192,6 +192,11 @@ class Training:
    batch_size: int = 8


yeah let's call it local_batch_size

Did a rename across the codebase whereever JobConfig.training.batch_size or --training.batch_size was used. Not sure how you'd like me to handle the compatibility breakage that this introduces.

tianyu-l · 2025-05-30T05:50:09Z

torchtitan/train.py

+        if job_config.training.global_batch_size < 0:
+            job_config.training.global_batch_size = (
+                job_config.training.batch_size * dp_degree
+            )
+        assert job_config.training.global_batch_size > 0
+        assert (
+            job_config.training.global_batch_size
+            % (job_config.training.batch_size * dp_degree)
+            == 0
+        ), (
+            f"global batch size must be multiple of local batch size times "
+            f"data-parallel degree ({job_config.training.global_batch_size} "
+            f"% ({job_config.training.batch_size} * {dp_degree}) != 0)"
+        )
+
+        self.gradient_accumulation_steps = job_config.training.global_batch_size // (
+            job_config.training.batch_size * dp_degree
+        )
+        assert self.gradient_accumulation_steps > 0


nit comment

Suggested change

if job_config.training.global_batch_size < 0:

job_config.training.global_batch_size = (

job_config.training.batch_size * dp_degree

)

assert job_config.training.global_batch_size > 0

assert (

job_config.training.global_batch_size

% (job_config.training.batch_size * dp_degree)

== 0

), (

f"global batch size must be multiple of local batch size times "

f"data-parallel degree ({job_config.training.global_batch_size} "

f"% ({job_config.training.batch_size} * {dp_degree}) != 0)"

)

self.gradient_accumulation_steps = job_config.training.global_batch_size // (

job_config.training.batch_size * dp_degree

)

assert self.gradient_accumulation_steps > 0

global_batch_size = job_config.training.global_batch_size

if global_batch_size < 0:

global_batch_size = job_config.training.batch_size * dp_degree

self.gradient_accumulation_steps = 1

else:

assert global_batch_size > (job_config.training.batch_size * dp_degree)

assert (

job_config.training.global_batch_size

% (job_config.training.batch_size * dp_degree)

== 0

), (

f"global batch size must be multiple of local batch size times "

f"data-parallel degree ({global_batch_size} "

f"% ({job_config.training.batch_size} * {dp_degree}) != 0)"

)

self.gradient_accumulation_steps = global_batch_size // (

job_config.training.batch_size * dp_degree

)

Don't really agree with not re-using the code that would become else case here, but can still change it to your recommendation. For now, I put the addition of the global_batch_size variable into its own commit, which probably already has the readability improvements that you'd like. Also added a comment in the if case that this global batch size results in 1 gradient accumulation step.

tianyu-l · 2025-05-30T05:53:42Z

torchtitan/train.py

@@ -183,6 +205,15 @@ def __init__(self, job_config: JobConfig):

        self.loss_fn = self.train_spec.build_loss_fn(job_config)

+        unwrapped_loss_fn = self.loss_fn


Let's put the self.gradient_accumulation_steps derivation code right before here, to group gradient accum logic together as much as possible.

I understand that it is desirable to fail early on infeasible global batch size, even before parallelism and other heavy things are applied. But I'd suggest we prioritize readability. What do you think?

Sounds fair! :)

Moved this.

torchtitan/train.py

tianyu-l · 2025-05-30T06:27:47Z

torchtitan/train.py

-
-        # Keep these variables local to shorten the code as these are
-        # the major variables that are used in the training loop.
+    def batch_backward(self, input_dict: dict[str, torch.Tensor], labels: torch.Tensor):


can we call it forward_backward_step?

Done. By the way, if you'd prefer me to squash these changes into the previous commits, I'd be happy to clean up the commit chain.

tianyu-l · 2025-05-30T06:29:45Z

torchtitan/train.py

+        model_parts = self.model_parts
+        world_mesh = self.world_mesh


similarly, maybe not worth keeping these two

torchtitan/train.py

tianyu-l · 2025-05-30T06:38:08Z

torchtitan/components/metrics.py

@@ -336,6 +337,7 @@ def __init__(
        )
        self.ntokens_since_last_log = 0
        self.data_loading_times = []
+        self.accumulated_losses = []


Since it represents a core training concept, rather than directly used for metrics logging, let's put this in Trainer, instead of MetricsProcessor.

Done. Also added the gradient_accumulation_steps attribute to the Trainer's dataclass attributes.

tianyu-l · 2025-05-30T06:57:33Z

torchtitan/train.py

+            except StopIteration:
+                # If data runs out during gradient accumulation, that
+                # entire step will not be executed.
+                return True


Instead of explicit return True, can we just call next and let the StopIteration exception propagate to train_step and catch over there?

I initially had it implemented this way, but thought the try block would encapsulate too much code. If anything else raises a StopIteration, it would make debugging much more difficult. Therefore the minimization of the try scope.

I would prefer directly raise StopIteration and let the outer loop to catch. As mentioned in the above discussion, the original design is to keep train_step() simple without data dependency. So there is no other StopIteration() afaik. If there are other places actually raise the StopIteration, we should figure it out.

If we really want to avoid ambiguity , we can have a customized next(), like next_batch() which will raise a customized DataDepleteException().

That's considerate. I think it's quite unlikely other places would also raise StopIteration? Maybe microbatching in pipeline parallel? But over there the number of microbatches should be fixed ahead of time.

Anyways, if you think we need to deal with this explicitly, we should catch the StopIteration exception, and raise a customized DataloaderStopIteration exception to be caught by caller, instead of return True.

Went with a combination of these suggestions; a Trainer.next_batch method basically just calls next(data_iterator), but catches and re-raises its StopIteration as a new DataloaderStopIteration.

tianyu-l · 2025-05-30T06:59:31Z

torchtitan/train.py

                self.step += 1
                self.gc_handler.run(self.step)
-                self.train_step(inputs, labels)
+                data_ran_out = self.train_step(data_iterator)


we can catch the StopIteration here and do different treatment on self.checkpointer.save in try vs. catch.

Has been changed, but we now simply break in case of the DataloaderStopIteration to prevent the change to the checkpointing logic.

This does change the general logic (e.g., torch_profiler and memory_profiler won't be stepped anymore) compared to the previous code, but is a bit nicer to read instead of adding an extra variable check in the while-query, IMO.

fegin · 2025-05-30T07:18:22Z

@tianyu-l Let me know what you think about the proposal above. Don't want @janEbert to be stuck in two different reviews.

tianyu-l

Also, this PR doesn't change the parallelization, which is not correct. We will have to call set_requires_gradient_sync if FSDP is applied.

@fegin For background please see #292 (comment)

I think for us we don't want the potential memory overhead and code complexity, although it can save some communications which could've been hidden anyway.

torchtitan/train.py

tianyu-l · 2025-05-30T07:34:12Z

torchtitan/train.py

+        unwrapped_loss_fn = self.loss_fn
+
+        @functools.wraps(unwrapped_loss_fn)
+        def accumulated_loss_fn(*args, **kwargs):


I'm OK either way.
I think being more explicit about grad accumulation handling doesn't look bad.
Also if we go with explicit global_batch_size and implicit grad_accu_steps, then we'll need to do another check & computation in the loss function.

torchtitan/train.py

fegin · 2025-05-30T15:40:30Z

torchtitan/train.py

+    def train_step(
+        self,
+        data_iterator: Iterable[tuple[dict[str, torch.Tensor], torch.Tensor]],
+    ) -> bool | None:


We should just return bool and change all other returns to return False to keep the semantic consistent. This should be changed if we still keep the returning value as the design option. But I prefer try/catch. See the below response.

Reverted this/refactored to try-catch solution as per other discussions. Return type is back to implicit None.

fegin · 2025-05-30T15:47:13Z

torchtitan/train.py

+            except StopIteration:
+                # If data runs out during gradient accumulation, that
+                # entire step will not be executed.
+                return True


I would prefer directly raise StopIteration and let the outer loop to catch. As mentioned in the above discussion, the original design is to keep train_step() simple without data dependency. So there is no other StopIteration() afaik. If there are other places actually raise the StopIteration, we should figure it out.

If we really want to avoid ambiguity , we can have a customized next(), like next_batch() which will raise a customized DataDepleteException().

fegin · 2025-05-30T17:03:16Z

The review order looks pretty confusing, lol. The summary of some big discussions:

Keep the design with a new forward_backward_step and global_batch to align with RL use case.
Avoid returning a value from train_step(), using a customized Exception for data depletion.

cc., @tianyu-l

tianyu-l · 2025-06-03T08:56:53Z

hey @janEbert how about let's work a bit more on the PR.

Sorry for the confusion in the reviews. I think we have agreed on the direction:

Keep the design with a new forward_backward_step and global_batch to align with RL use case.
Avoid returning a value from train_step(), using a customized Exception for data depletion.

Please also add a test case in https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests.py

@fegin

@fegin said: > TorchTitan currently doesn't perform force checkpoint if data is > depleted. We can fix this but I suggest that we don't do this in this > PR. (See pytorch#1238 (comment).)

janEbert · 2025-06-03T13:46:18Z

I believe I have incorporated all the feedback. Let me know how you like the changes. FYI I'm currently on a conference and on vacation from Friday, so it would be great to get this done before Friday, even if I may only sporadically find time. :)

Fix pytorch#292.

Previously `int | None`. Makes it possible to obtain the automatic calculation of it when it has already been set in a TOML config.

@fegin

@fegin said: > TorchTitan currently doesn't perform force checkpoint if data is > depleted. We can fix this but I suggest that we don't do this in this > PR. (See pytorch#1238 (comment).)

I.e., a new `DataloaderStopIteration` that inherits from `StopIteration`. Accordingly, no longer return an optional `bool` to indicate depletion and adapt the remainder of the code to catch the new exception instead.

This concerns only renaming - `--training.batch_size` to `--training.local_batch_size` and - `job_config.training.batch_size` to `job_config.training.local_batch_size`.

I.e., the method in `Trainer`.

Instead use a new helper variable `global_batch_size` for all logic. Improves readability.

Improve readability.

These were only used in 1 or 2 locations each.

... from `MetricsProcessor`.

janEbert · 2025-06-03T14:15:13Z

Rebased because of local_batch_size changes.

tianyu-l

Looks almost good! Please address final comments.

Also the addition of forward_backward_step breaks the FLUX model training.
Could you help refactor the train_step to forward_backward_step over there? Probably just

remove the optimizer.zero_grad
remove https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/flux/train.py#L152-L180
return loss

For the eval step https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/flux/train.py#L182
It should be done in Trainer.train(), but since we are not using grad accumulation in FLUX training, it is OK to leave it in forward_backward_step to accelerate landing of this PR, as long as CI tests pass. @wwwjn and I will work together on fixing it later.

tianyu-l · 2025-06-04T06:58:04Z

tests/integration_tests.py

+        OverrideDefinitions(
+            [
+                [
+                    # Default local batch size = 8, and `ngpu=2`, so


Let's explicitly specify local batch size as well, in case some future PR change the default without changing the test here.

tianyu-l · 2025-06-04T06:59:20Z

torchtitan/config_manager.py

@@ -333,7 +338,7 @@ class Parallelism:
    pipeline_parallel_microbatch_size: int = 1
    """
    The size of each pipeline parallel microbatch (default 1).
-    This value is used to compute the total number of microbatches by dividing batch_size with
+    This value is used to compute the total number of microbatches by dividing local batch_size with


Suggested change

This value is used to compute the total number of microbatches by dividing local batch_size with

This value is used to compute the total number of microbatches by dividing local_batch_size with

Great catch! I didn't see the underscore on my dirty screen lol

tianyu-l · 2025-06-04T07:01:39Z

torchtitan/train.py

+class DataloaderStopIteration(StopIteration):
+    """An exception that indicates dataloader exhaustion."""
+
+    pass


let's put this in https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/dataloader.py

tianyu-l · 2025-06-04T07:08:07Z

torchtitan/train.py

+                try:
+                    self.train_step(data_iterator)
+                except DataloaderStopIteration:
+                    logger.info("Ran out of data; last step was canceled.")


Suggested change

logger.info("Ran out of data; last step was canceled.")

logger.warning("Ran out of data; last step was canceled.")

tianyu-l · 2025-06-04T07:21:18Z

torchtitan/train.py

-
-        # Keep these variables local to shorten the code as these are
-        # the major variables that are used in the training loop.
+    def next_batch(


This function sounds less necessary, especially when we already have dataloader and batch_generator. Given how short it is, it seems not too bad just running the try-catch in train_step?

To me, it makes the train_step look cleaner and it was nice to have it re-usable for the FLUX refactor. Does that change your mind? :)

I was thinking to patch the data iterator's __next__ method on-the-fly, to ensure the DataloaderStopIteration is raised, but didn't want to put too much black magic. It would require modifying the ParallelAwareDataloader.__iter__ method to apply the patch to the returned iterator. What do you think of that option?

I would suggest to keep the current implementation. Monkey patching is usually not a good idea. Also agree this function makes train_step cleaner.

Some future benefit, we may want to do data loader pipelining, which overlaps the to("cuda") with the computation. This function gives us a good place to implement it.

tianyu-l · 2025-06-04T07:30:08Z

torchtitan/train.py

+            job_config.training.local_batch_size * dp_degree
+        )
+        assert self.gradient_accumulation_steps > 0
+        self.loss_fn = rescale_accumulated_loss(


This is a comment not a suggestion:

The code sounds to me assuming the loss function we use must perform a "mean" reduction, instead of "sum" also available in e.g. cross entropy loss.
But I believe this assumption is also made in pytorch DDP, FSDP, PP, and universally accepted as the default now. So I think it's ok.

Good point. I added a docstring to the function to explicitly mention this.

Yes, CP also assumes mean. A docstring will be nice, thanks!

... toward `forward_backward_step` design.

janEbert · 2025-06-04T12:12:38Z

PTAL.

wwwjn · 2025-06-04T16:54:45Z

Looks almost good! Please address final comments.

Also the addition of forward_backward_step breaks the FLUX model training. Could you help refactor the train_step to forward_backward_step over there? Probably just

remove the optimizer.zero_grad

remove https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/flux/train.py#L152-L180

return loss

For the eval step https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/flux/train.py#L182 It should be done in Trainer.train(), but since we are not using grad accumulation in FLUX training, it is OK to leave it in forward_backward_step to accelerate landing of this PR, as long as CI tests pass. @wwwjn and I will work together on fixing it later.

Agree! The current change on FLUX side looks good to me. In the future I will also test grad accumulation w/ FLUX. Ideally in the future I will move eval_step() in to the main trainer's train loop, and reuse main trainer's train_step() in FLUX

fegin

LGTM. There are some typing nits, but overall the implementation is clean.

fegin · 2025-06-04T17:07:34Z

torchtitan/train.py

+
+    def forward_backward_step(
+        self, input_dict: dict[str, torch.Tensor], labels: torch.Tensor
+    ):


Can we type the return value?

fegin · 2025-06-04T17:10:07Z

torchtitan/train.py

+
+    def train_step(
+        self, data_iterator: Iterable[tuple[dict[str, torch.Tensor], torch.Tensor]]
+    ):


ditto, can we type the return value?

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 29, 2025

janEbert mentioned this pull request May 29, 2025

[Feature] Add gradient accumulation #292

Open

fegin requested changes May 30, 2025

View reviewed changes

tianyu-l reviewed May 30, 2025

View reviewed changes

tianyu-l linked an issue May 30, 2025 that may be closed by this pull request

[Feature] Add gradient accumulation #292

Open

tianyu-l reviewed May 30, 2025

View reviewed changes

fegin reviewed May 30, 2025

View reviewed changes

tianyu-l mentioned this pull request Jun 3, 2025

[Experimental Feature] Huggingface model training #919

Open

janEbert added 16 commits June 3, 2025 16:08

Refactor batch loss and grad calculation

dc2be36

Support gradient accumulation

f09211f

Fix pytorch#292.

Run pre-commit hooks

863611b

Change global_batch_size type to int

8383d2f

Previously `int | None`. Makes it possible to obtain the automatic calculation of it when it has already been set in a TOML config.

Do not save checkpoint when data ran out

d6d5a24

@fegin said: > TorchTitan currently doesn't perform force checkpoint if data is > depleted. We can fix this but I suggest that we don't do this in this > PR. (See pytorch#1238 (comment).)

Raise custom exception upon data depletion

0f921f2

I.e., a new `DataloaderStopIteration` that inherits from `StopIteration`. Accordingly, no longer return an optional `bool` to indicate depletion and adapt the remainder of the code to catch the new exception instead.

Rename "batch size" to "local batch size"

120f44d

This concerns only renaming - `--training.batch_size` to `--training.local_batch_size` and - `job_config.training.batch_size` to `job_config.training.local_batch_size`.

Rename batch_backward to forward_backward_step

d61bf21

I.e., the method in `Trainer`.

Refactor loss function gradient accumulation wrap

cead45d

Do not modify job_config.global_batch_size

3d6ff9e

Instead use a new helper variable `global_batch_size` for all logic. Improves readability.

Add comment on default gradient accumulation step

5bc0759

Move gradient accumulation derivation logic

22bfb99

Improve readability.

Remove redundant shortcut variables

5d3d5ed

These were only used in 1 or 2 locations each.

Improve readability

6599faa

Add gradient_accumulation_step to dataclass

52613d2

Move accumulated_losses to Trainer

8f5ae5f

... from `MetricsProcessor`.

janEbert added 2 commits June 3, 2025 16:14

Apply pre-commit hooks

46954b6

Add gradient accumulation integration test

14edf13

janEbert force-pushed the grad-accum-pr branch from c265c65 to 14edf13 Compare June 3, 2025 14:14

tianyu-l reviewed Jun 4, 2025

View reviewed changes

janEbert added 7 commits June 4, 2025 13:53

Fix FLUX trainer

c2c5ca3

Refactor FLUX train step

0970ff9

... toward `forward_backward_step` design.

Use fixed local batch size

157ef8b

Fix typo

4574f38

Move custom StopIteration exception

3bd7910

Fix log type

9144fdc

Add docstring to rescaled loss function

afec117

fegin approved these changes Jun 4, 2025

View reviewed changes

		@@ -183,6 +205,15 @@ def __init__(self, job_config: JobConfig):

		self.loss_fn = self.train_spec.build_loss_fn(job_config)

		unwrapped_loss_fn = self.loss_fn

	This value is used to compute the total number of microbatches by dividing local batch_size with
	This value is used to compute the total number of microbatches by dividing local_batch_size with

	logger.info("Ran out of data; last step was canceled.")
	logger.warning("Ran out of data; last step was canceled.")

Support gradient accumulation #1238

Are you sure you want to change the base?

Support gradient accumulation #1238

Conversation

janEbert commented May 29, 2025

Uh oh!

janEbert commented May 29, 2025

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin commented May 30, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin commented May 30, 2025 •

edited

Loading

tianyu-l commented Jun 3, 2025 •

edited

Loading