Skip to content

Commit f5177b8

Browse files
committed
Do not save checkpoint when data ran out
@fegin said: > TorchTitan currently doesn't perform force checkpoint if data is > depleted. We can fix this but I suggest that we don't do this in this > PR. (See pytorch#1238 (comment).)
1 parent 9402673 commit f5177b8

File tree

1 file changed

+3
-8
lines changed

1 file changed

+3
-8
lines changed

torchtitan/train.py

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -494,14 +494,9 @@ def train(self):
494494
self.gc_handler.run(self.step)
495495
data_ran_out = self.train_step(data_iterator)
496496
if data_ran_out:
497-
logger.info(
498-
"Ran out of data; last step was canceled. "
499-
"Saving final checkpoint and exiting."
500-
)
501-
self.checkpointer.save(
502-
self.step,
503-
force=(self.step == job_config.training.steps or data_ran_out),
504-
)
497+
logger.info("Ran out of data; last step was canceled.")
498+
break
499+
self.checkpointer.save(self.step, force=(self.step == job_config.training.steps))
505500

506501
# signal the profiler that the next profiling step has started
507502
if torch_profiler:

0 commit comments

Comments
 (0)