Skip to content

Commit 671c30c

Browse files
Saiteja64saiteja64
authored andcommitted
Titan changes to use DCP ZOC instead of titan default Async + Pinned Memory (#1287)
Test Titan changes to use DCP ZOC instead of titan default ### Loss Curve DCP + ZOC. Traing with DCP ZOC Until Step 500. Delete Last Checkpoint (since it's sync save). Then Run Load and Run Training Until Step 1000 <img width="1126" alt="Screenshot 2025-06-25 at 3 59 54 PM" src="https://github.com/user-attachments/assets/d5c7986d-9a03-402f-a664-239807285117" /> ### Loss Curve TorchTitan Async + Pinned Memory Without DCP ZOC <img width="936" alt="Screenshot 2025-06-25 at 4 05 50 PM" src="https://github.com/user-attachments/assets/0f616460-076b-4ef9-ba53-e0e285c030e7" /> ### DCP + Titan Training on LLAMA3 8B Model For 500 Steps ``` CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh + NGPU=8 + export LOG_RANK=0 + LOG_RANK=0 + CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml + overrides= + '[' 0 -ne 0 ']' + TORCHFT_LIGHTHOUSE=http://localhost:29510 + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + TORCHFT_LIGHTHOUSE=http://localhost:29510 + torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/llama3_8b.toml W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] ***************************************** W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] ***************************************** [rank0]:[titan] 2025-06-25 12:55:00,591 - root - INFO - Starting job: Llama 3 8B training [rank0]:[titan] 2025-06-25 12:55:02,714 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:[titan] 2025-06-25 12:55:02,716 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8] [rank0]:[titan] 2025-06-25 12:55:02,717 - root - INFO - [GC] Initial GC collection. 0.00 seconds. [rank0]:NCCL version 2.27.3+cuda12.4 [rank0]:[rank0]:[W625 12:55:10.376686804 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[titan] 2025-06-25 12:55:11,222 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001 [rank0]:[titan] 2025-06-25 12:55:11,223 - root - INFO - Preparing c4 dataset from allenai/c4 [rank0]:[titan] 2025-06-25 12:55:17,789 - root - INFO - Building llama3 8B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=128001) [rank0]:[titan] 2025-06-25 12:55:17,927 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250625-1255 [rank0]:[titan] 2025-06-25 12:55:17,928 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory [rank0]:[titan] 2025-06-25 12:55:17,943 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters [rank0]:[titan] 2025-06-25 12:55:17,944 - root - INFO - Applied selective activation checkpointing to the model [rank0]:[titan] 2025-06-25 12:55:18,010 - root - INFO - Applied FSDP to the model [rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [rank0]:[titan] 2025-06-25 12:55:18,276 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14 [rank0]:[titan] 2025-06-25 12:55:18,276 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%) [rank0]:[rank0]:[W625 12:55:18.048917586 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint [rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 500 (warmup 200). [rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Training starts at step 1. [rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace [rank0]:[titan] 2025-06-25 12:55:24,823 - root - INFO - step: 1 loss: 12.2478 memory: 42.16GiB(44.38%) tps: 1,191 tflops: 68.97 mfu: 6.97% [rank0]:[titan] 2025-06-25 12:55:24,824 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 12:55:24,842 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds. [rank0]:[titan] 2025-06-25 12:55:24,859 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.04 seconds. [rank0]:[titan] 2025-06-25 12:55:24,859 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[0] 2025-06-25 12:55:32,097 - root - INFO - Initializing dist.ProcessGroup in checkpoint background process [rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [rank0]:[W625 12:55:32.282661499 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[0] 2025-06-25 12:55:32,539 - root - INFO - Checkpoint background process is running... [rank0]:[0] 2025-06-25 12:55:32,539 - root - INFO - Waiting for checkpoint save request... [rank0]:[0] 2025-06-25 12:55:37,346 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-1 [rank0]:[titan] 2025-06-25 12:55:55,589 - root - INFO - step: 10 loss: 9.9493 memory: 49.66GiB(52.28%) tps: 2,397 tflops: 138.80 mfu: 14.03% [rank0]:[titan] 2025-06-25 12:56:09,436 - root - INFO - step: 20 loss: 8.5802 memory: 49.66GiB(52.28%) tps: 5,917 tflops: 342.67 mfu: 34.65% [rank0]:[titan] 2025-06-25 12:56:23,343 - root - INFO - step: 30 loss: 7.7218 memory: 49.66GiB(52.28%) tps: 5,892 tflops: 341.21 mfu: 34.50% [rank0]:[titan] 2025-06-25 12:56:37,275 - root - INFO - step: 40 loss: 7.3309 memory: 49.66GiB(52.28%) tps: 5,881 tflops: 340.60 mfu: 34.44% [rank0]:[0] 2025-06-25 12:56:46,074 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-1', uuid='320c8c66-1115-416f-a22a-50d86e094b3a') [rank0]:[0] 2025-06-25 12:56:46,074 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 12:56:49,706 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds. [rank0]:[titan] 2025-06-25 12:56:51,223 - root - INFO - step: 50 loss: 7.1024 memory: 49.66GiB(52.28%) tps: 5,874 tflops: 340.20 mfu: 34.40% [rank0]:[titan] 2025-06-25 12:57:05,194 - root - INFO - step: 60 loss: 6.8512 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.61 mfu: 34.34% [rank0]:[titan] 2025-06-25 12:57:19,180 - root - INFO - step: 70 loss: 6.9118 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.30 mfu: 34.31% [rank0]:[titan] 2025-06-25 12:57:33,158 - root - INFO - step: 80 loss: 6.6411 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.46 mfu: 34.32% [rank0]:[titan] 2025-06-25 12:57:47,135 - root - INFO - step: 90 loss: 6.6820 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.50 mfu: 34.33% [rank0]:[titan] 2025-06-25 12:58:00,303 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-25 12:58:01,883 - root - INFO - step: 100 loss: 6.4539 memory: 49.66GiB(52.28%) tps: 5,555 tflops: 321.72 mfu: 32.53% [rank0]:[titan] 2025-06-25 12:58:01,883 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 12:58:01,884 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 12:58:01,904 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 12:58:02,647 - root - INFO - Dumping profiler traces at step 100 [rank0]:[titan] 2025-06-25 12:58:02,863 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[0] 2025-06-25 12:58:03,725 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-100 [rank0]:[titan] 2025-06-25 12:58:17,301 - root - INFO - step: 110 loss: 6.6926 memory: 49.66GiB(52.28%) tps: 5,314 tflops: 307.75 mfu: 31.12% [rank0]:[titan] 2025-06-25 12:58:31,306 - root - INFO - step: 120 loss: 6.3697 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.81 mfu: 34.26% [rank0]:[titan] 2025-06-25 12:58:45,299 - root - INFO - step: 130 loss: 6.4435 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.10 mfu: 34.29% [rank0]:[titan] 2025-06-25 12:58:59,293 - root - INFO - step: 140 loss: 6.3808 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.05 mfu: 34.28% [rank0]:[0] 2025-06-25 12:58:59,562 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-100', uuid='324a8712-2df4-4b32-ae9f-eb26516ce3ac') [rank0]:[0] 2025-06-25 12:58:59,562 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 12:59:11,719 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 12:59:13,294 - root - INFO - step: 150 loss: 6.0557 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.92 mfu: 34.27% [rank0]:[titan] 2025-06-25 12:59:27,291 - root - INFO - step: 160 loss: 6.3298 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.01 mfu: 34.28% [rank0]:[titan] 2025-06-25 12:59:41,291 - root - INFO - step: 170 loss: 6.0259 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.92 mfu: 34.27% [rank0]:[titan] 2025-06-25 12:59:55,283 - root - INFO - step: 180 loss: 6.0728 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.13 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:00:09,269 - root - INFO - step: 190 loss: 6.7391 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.26 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:00:21,734 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds. [rank0]:[titan] 2025-06-25 13:00:23,274 - root - INFO - step: 200 loss: 6.1409 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.79 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:00:23,274 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:00:23,275 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:00:23,295 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:00:23,801 - root - INFO - Dumping profiler traces at step 200 [rank0]:[titan] 2025-06-25 13:00:24,009 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[0] 2025-06-25 13:00:24,833 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-200 [rank0]:[titan] 2025-06-25 13:00:38,511 - root - INFO - step: 210 loss: 6.0412 memory: 49.66GiB(52.28%) tps: 5,377 tflops: 311.41 mfu: 31.49% [rank0]:[titan] 2025-06-25 13:00:52,498 - root - INFO - step: 220 loss: 6.0333 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.22 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:01:06,488 - root - INFO - step: 230 loss: 5.9720 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.18 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:01:20,477 - root - INFO - step: 240 loss: 5.8235 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.21 mfu: 34.30% [rank0]:[0] 2025-06-25 13:01:21,487 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-200', uuid='9a1e278a-8372-4bf5-b7f7-a7fd2b80cc5e') [rank0]:[0] 2025-06-25 13:01:21,487 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:01:32,911 - root - INFO - [GC] Peforming periodical GC collection. 0.09 seconds. [rank0]:[titan] 2025-06-25 13:01:34,464 - root - INFO - step: 250 loss: 5.8334 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.25 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:01:49,456 - root - INFO - step: 260 loss: 5.7444 memory: 49.66GiB(52.28%) tps: 5,465 tflops: 316.51 mfu: 32.00% [rank0]:[titan] 2025-06-25 13:02:03,466 - root - INFO - step: 270 loss: 5.7170 memory: 49.66GiB(52.28%) tps: 5,848 tflops: 338.69 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:02:17,468 - root - INFO - step: 280 loss: 5.6201 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.87 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:02:31,468 - root - INFO - step: 290 loss: 5.7282 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.93 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:02:43,898 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 13:02:45,474 - root - INFO - step: 300 loss: 5.7546 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.76 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:02:45,475 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:02:45,476 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:02:45,496 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:02:46,012 - root - INFO - Dumping profiler traces at step 300 [rank0]:[titan] 2025-06-25 13:02:46,220 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[0] 2025-06-25 13:02:47,106 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-300 [rank0]:[titan] 2025-06-25 13:03:00,674 - root - INFO - step: 310 loss: 5.5701 memory: 49.66GiB(52.28%) tps: 5,390 tflops: 312.17 mfu: 31.56% [rank0]:[titan] 2025-06-25 13:03:14,668 - root - INFO - step: 320 loss: 5.5060 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.09 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:03:28,667 - root - INFO - step: 330 loss: 5.4723 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 338.96 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:03:42,657 - root - INFO - step: 340 loss: 5.5342 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.18 mfu: 34.29% [rank0]:[0] 2025-06-25 13:03:45,717 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-300', uuid='c7b82f66-2746-49c4-b626-f527beec83cb') [rank0]:[0] 2025-06-25 13:03:45,718 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:03:55,085 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-25 13:03:56,649 - root - INFO - step: 350 loss: 5.5262 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.11 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:04:10,642 - root - INFO - step: 360 loss: 5.4645 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.11 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:04:24,639 - root - INFO - step: 370 loss: 5.3705 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.02 mfu: 34.28% [rank0]:[titan] 2025-06-25 13:04:38,629 - root - INFO - step: 380 loss: 5.3711 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.18 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:04:52,611 - root - INFO - step: 390 loss: 5.2803 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.37 mfu: 34.31% [rank0]:[titan] 2025-06-25 13:05:05,085 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds. [rank0]:[titan] 2025-06-25 13:05:06,613 - root - INFO - step: 400 loss: 5.4034 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.86 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:05:06,614 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:05:06,615 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:05:06,642 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.03 seconds. [rank0]:[titan] 2025-06-25 13:05:07,153 - root - INFO - Dumping profiler traces at step 400 [rank0]:[titan] 2025-06-25 13:05:07,351 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:05:08,248 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-400 [rank0]:[titan] 2025-06-25 13:05:21,858 - root - INFO - step: 410 loss: 5.2031 memory: 49.66GiB(52.28%) tps: 5,374 tflops: 311.24 mfu: 31.47% [rank0]:[titan] 2025-06-25 13:05:35,829 - root - INFO - step: 420 loss: 5.3476 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.65 mfu: 34.34% [rank0]:[titan] 2025-06-25 13:05:49,793 - root - INFO - step: 430 loss: 5.2579 memory: 49.66GiB(52.28%) tps: 5,867 tflops: 339.80 mfu: 34.36% [rank0]:[titan] 2025-06-25 13:06:03,755 - root - INFO - step: 440 loss: 5.2553 memory: 49.66GiB(52.28%) tps: 5,868 tflops: 339.85 mfu: 34.36% [rank0]:[0] 2025-06-25 13:06:06,800 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-400', uuid='22d8ae0c-f9c7-4424-ac4b-ab4fdaa4655f') [rank0]:[0] 2025-06-25 13:06:06,800 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:06:16,214 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds. [rank0]:[titan] 2025-06-25 13:06:17,743 - root - INFO - step: 450 loss: 5.1627 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.25 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:06:31,746 - root - INFO - step: 460 loss: 5.1879 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.86 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:06:45,740 - root - INFO - step: 470 loss: 5.0814 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.07 mfu: 34.28% [rank0]:[titan] 2025-06-25 13:06:59,746 - root - INFO - step: 480 loss: 5.0594 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.80 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:07:13,748 - root - INFO - step: 490 loss: 5.1015 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.88 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:07:26,234 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-25 13:07:27,814 - root - INFO - step: 500 loss: 5.1793 memory: 49.66GiB(52.28%) tps: 5,824 tflops: 337.30 mfu: 34.11% [rank0]:[titan] 2025-06-25 13:07:27,815 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:07:27,815 - root - INFO - Saving a full checkpoint at last step, step 500. [rank0]:[titan] 2025-06-25 13:08:17,506 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds. [rank0]:[titan] 2025-06-25 13:08:17,506 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 49.69 seconds. [rank0]:[titan] 2025-06-25 13:08:18,133 - root - INFO - Dumping profiler traces at step 500 [rank0]:[titan] 2025-06-25 13:08:18,369 - root - INFO - Finished dumping profiler traces in 0.24 seconds [rank0]:[titan] 2025-06-25 13:08:18,369 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:[titan] 2025-06-25 13:08:20,369 - root - INFO - Training completed [rank0]:[titan] 2025-06-25 13:08:20,370 - root - INFO - Destroying the purge thread. [rank0]:[titan] 2025-06-25 13:08:24,141 - root - INFO - Process group destroyed. ``` ### Delete Checkpoint at 500th Step (Since this was sync saved). Run LLAMA3 8B model training to 1000 steps ``` CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh + NGPU=8 + export LOG_RANK=0 + LOG_RANK=0 + CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml + overrides= + '[' 0 -ne 0 ']' + TORCHFT_LIGHTHOUSE=http://localhost:29510 + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True + TORCHFT_LIGHTHOUSE=http://localhost:29510 + torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/llama3_8b.toml W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] ***************************************** W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] ***************************************** [rank0]:[titan] 2025-06-25 13:10:27,930 - root - INFO - Starting job: Llama 3 8B training [rank0]:[titan] 2025-06-25 13:10:32,443 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:[titan] 2025-06-25 13:10:32,445 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8] [rank0]:[titan] 2025-06-25 13:10:32,447 - root - INFO - [GC] Initial GC collection. 0.00 seconds. [rank0]:NCCL version 2.27.3+cuda12.4 [rank0]:[rank0]:[W625 13:10:37.522041468 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[titan] 2025-06-25 13:10:38,277 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001 [rank0]:[titan] 2025-06-25 13:10:38,277 - root - INFO - Preparing c4 dataset from allenai/c4 [rank0]:[titan] 2025-06-25 13:10:43,788 - root - INFO - Building llama3 8B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=128001) [rank0]:[titan] 2025-06-25 13:10:43,930 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250625-1310 [rank0]:[titan] 2025-06-25 13:10:43,931 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory [rank0]:[titan] 2025-06-25 13:10:43,946 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters [rank0]:[titan] 2025-06-25 13:10:43,947 - root - INFO - Applied selective activation checkpointing to the model [rank0]:[titan] 2025-06-25 13:10:44,010 - root - INFO - Applied FSDP to the model [rank0]:[titan] 2025-06-25 13:10:44,268 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14 [rank0]:[titan] 2025-06-25 13:10:44,268 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%) [rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [rank0]:[rank0]:[W625 13:10:45.257529860 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint [rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 1000 (warmup 200). [rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Loading the checkpoint at step 400. [rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - [GC] GC collection for checkpoint loading. 0.03 seconds. [rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - Finished loading the checkpoint in 11.31 seconds. [rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - Training starts at step 401. [rank0]:[titan] 2025-06-25 13:10:56,822 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace [rank0]:[titan] 2025-06-25 13:11:13,774 - root - INFO - step: 410 loss: 5.2593 memory: 49.66GiB(52.28%) tps: 2,746 tflops: 159.06 mfu: 16.08% [rank0]:[titan] 2025-06-25 13:11:27,654 - root - INFO - step: 420 loss: 5.4461 memory: 49.66GiB(52.28%) tps: 5,903 tflops: 341.86 mfu: 34.57% [rank0]:[titan] 2025-06-25 13:11:41,595 - root - INFO - step: 430 loss: 5.3764 memory: 49.66GiB(52.28%) tps: 5,877 tflops: 340.34 mfu: 34.41% [rank0]:[titan] 2025-06-25 13:11:55,562 - root - INFO - step: 440 loss: 5.3640 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.74 mfu: 34.35% [rank0]:[titan] 2025-06-25 13:12:07,987 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-25 13:12:09,550 - root - INFO - step: 450 loss: 5.2764 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.23 mfu: 34.30% [rank0]:[titan] 2025-06-25 13:12:23,550 - root - INFO - step: 460 loss: 5.2871 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.91 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:12:37,552 - root - INFO - step: 470 loss: 5.1846 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.90 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:12:51,565 - root - INFO - step: 480 loss: 5.1532 memory: 49.66GiB(52.28%) tps: 5,847 tflops: 338.61 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:13:05,570 - root - INFO - step: 490 loss: 5.2000 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.82 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:13:19,252 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 13:13:20,824 - root - INFO - step: 500 loss: 5.2609 memory: 49.66GiB(52.28%) tps: 5,371 tflops: 311.03 mfu: 31.45% [rank0]:[titan] 2025-06-25 13:13:20,825 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:13:20,826 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:13:20,844 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:13:21,785 - root - INFO - Dumping profiler traces at step 500 [rank0]:[titan] 2025-06-25 13:13:22,009 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[0] 2025-06-25 13:13:30,315 - root - INFO - Initializing dist.ProcessGroup in checkpoint background process [rank0]:[W625 13:13:30.320876720 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator()) [rank0]:[0] 2025-06-25 13:13:30,577 - root - INFO - Checkpoint background process is running... [rank0]:[0] 2025-06-25 13:13:30,577 - root - INFO - Waiting for checkpoint save request... [rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [rank0]:[0] 2025-06-25 13:13:35,906 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-500 [rank0]:[titan] 2025-06-25 13:13:55,786 - root - INFO - step: 510 loss: 5.0734 memory: 49.66GiB(52.28%) tps: 2,343 tflops: 135.71 mfu: 13.72% [rank0]:[titan] 2025-06-25 13:14:09,750 - root - INFO - step: 520 loss: 5.2781 memory: 49.66GiB(52.28%) tps: 5,867 tflops: 339.80 mfu: 34.36% [rank0]:[titan] 2025-06-25 13:14:23,740 - root - INFO - step: 530 loss: 5.1984 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.15 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:14:37,748 - root - INFO - step: 540 loss: 5.0945 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.73 mfu: 34.25% [rank0]:[0] 2025-06-25 13:14:46,826 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-500', uuid='851c2b13-e42f-4630-9a50-edb23b6a52ac') [rank0]:[0] 2025-06-25 13:14:46,826 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:14:50,219 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds. [rank0]:[titan] 2025-06-25 13:14:51,767 - root - INFO - step: 550 loss: 5.1208 memory: 49.66GiB(52.28%) tps: 5,844 tflops: 338.47 mfu: 34.22% [rank0]:[titan] 2025-06-25 13:15:05,767 - root - INFO - step: 560 loss: 5.1587 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.93 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:15:19,775 - root - INFO - step: 570 loss: 4.9740 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.73 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:15:33,792 - root - INFO - step: 580 loss: 5.0386 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.50 mfu: 34.23% [rank0]:[titan] 2025-06-25 13:15:47,811 - root - INFO - step: 590 loss: 5.0741 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.48 mfu: 34.22% [rank0]:[titan] 2025-06-25 13:16:00,259 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 13:16:01,831 - root - INFO - step: 600 loss: 4.9624 memory: 49.66GiB(52.28%) tps: 5,844 tflops: 338.43 mfu: 34.22% [rank0]:[titan] 2025-06-25 13:16:01,831 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:16:01,833 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:16:01,853 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:16:02,368 - root - INFO - Dumping profiler traces at step 600 [rank0]:[titan] 2025-06-25 13:16:02,566 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:16:03,466 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-600 [rank0]:[titan] 2025-06-25 13:16:17,066 - root - INFO - step: 610 loss: 5.0695 memory: 49.66GiB(52.28%) tps: 5,378 tflops: 311.46 mfu: 31.49% [rank0]:[titan] 2025-06-25 13:16:31,050 - root - INFO - step: 620 loss: 5.0808 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.31 mfu: 34.31% [rank0]:[titan] 2025-06-25 13:16:45,041 - root - INFO - step: 630 loss: 4.9225 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.14 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:16:59,065 - root - INFO - step: 640 loss: 4.8939 memory: 49.66GiB(52.28%) tps: 5,842 tflops: 338.36 mfu: 34.21% [rank0]:[0] 2025-06-25 13:17:05,017 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-600', uuid='05657add-86bb-4707-b3d6-3494d586c34d') [rank0]:[0] 2025-06-25 13:17:05,017 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:17:11,847 - root - INFO - [GC] Peforming periodical GC collection. 0.23 seconds. [rank0]:[titan] 2025-06-25 13:17:13,311 - root - INFO - step: 650 loss: 5.0611 memory: 49.66GiB(52.28%) tps: 5,751 tflops: 333.05 mfu: 33.68% [rank0]:[titan] 2025-06-25 13:17:27,609 - root - INFO - step: 660 loss: 4.9348 memory: 49.66GiB(52.28%) tps: 5,731 tflops: 331.88 mfu: 33.56% [rank0]:[titan] 2025-06-25 13:17:41,811 - root - INFO - step: 670 loss: 4.8798 memory: 49.66GiB(52.28%) tps: 5,769 tflops: 334.13 mfu: 33.78% [rank0]:[titan] 2025-06-25 13:17:56,001 - root - INFO - step: 680 loss: 4.9749 memory: 49.66GiB(52.28%) tps: 5,774 tflops: 334.40 mfu: 33.81% [rank0]:[titan] 2025-06-25 13:18:10,004 - root - INFO - step: 690 loss: 4.7231 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.86 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:18:22,447 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-25 13:18:24,013 - root - INFO - step: 700 loss: 4.8765 memory: 49.66GiB(52.28%) tps: 5,848 tflops: 338.69 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:18:24,014 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:18:24,015 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:18:24,034 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:18:24,557 - root - INFO - Dumping profiler traces at step 700 [rank0]:[titan] 2025-06-25 13:18:24,762 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:18:25,616 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-700 [rank0]:[titan] 2025-06-25 13:18:39,233 - root - INFO - step: 710 loss: 4.8973 memory: 49.66GiB(52.28%) tps: 5,383 tflops: 311.76 mfu: 31.52% [rank0]:[titan] 2025-06-25 13:18:53,245 - root - INFO - step: 720 loss: 4.8671 memory: 49.66GiB(52.28%) tps: 5,848 tflops: 338.67 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:19:07,252 - root - INFO - step: 730 loss: 5.1143 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.76 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:19:21,270 - root - INFO - step: 740 loss: 4.9568 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.51 mfu: 34.23% [rank0]:[0] 2025-06-25 13:19:28,527 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-700', uuid='40580106-cb24-4788-a99d-b10ff483e779') [rank0]:[0] 2025-06-25 13:19:28,527 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:19:33,755 - root - INFO - [GC] Peforming periodical GC collection. 0.11 seconds. [rank0]:[titan] 2025-06-25 13:19:35,288 - root - INFO - step: 750 loss: 4.6943 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.49 mfu: 34.23% [rank0]:[titan] 2025-06-25 13:19:49,292 - root - INFO - step: 760 loss: 4.7441 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.85 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:20:03,668 - root - INFO - step: 770 loss: 4.7597 memory: 49.66GiB(52.28%) tps: 5,699 tflops: 330.05 mfu: 33.37% [rank0]:[titan] 2025-06-25 13:20:18,428 - root - INFO - step: 780 loss: 4.7751 memory: 49.66GiB(52.28%) tps: 5,551 tflops: 321.47 mfu: 32.50% [rank0]:[titan] 2025-06-25 13:20:32,447 - root - INFO - step: 790 loss: 4.7187 memory: 49.66GiB(52.28%) tps: 5,845 tflops: 338.49 mfu: 34.23% [rank0]:[titan] 2025-06-25 13:20:44,878 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-25 13:20:46,458 - root - INFO - step: 800 loss: 4.7215 memory: 49.66GiB(52.28%) tps: 5,847 tflops: 338.65 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:20:46,458 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:20:46,459 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:20:46,479 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:20:47,058 - root - INFO - Dumping profiler traces at step 800 [rank0]:[titan] 2025-06-25 13:20:47,260 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:20:48,131 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-800 [rank0]:[titan] 2025-06-25 13:21:01,733 - root - INFO - step: 810 loss: 4.7147 memory: 49.66GiB(52.28%) tps: 5,364 tflops: 310.63 mfu: 31.41% [rank0]:[titan] 2025-06-25 13:21:15,731 - root - INFO - step: 820 loss: 4.6326 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 338.97 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:21:29,735 - root - INFO - step: 830 loss: 4.7056 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.85 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:21:43,727 - root - INFO - step: 840 loss: 4.6464 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.13 mfu: 34.29% [rank0]:[0] 2025-06-25 13:21:53,837 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-800', uuid='a5a2aa11-d40c-4d6e-8f58-a21181f08eaf') [rank0]:[0] 2025-06-25 13:21:53,837 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:21:56,225 - root - INFO - [GC] Peforming periodical GC collection. 0.13 seconds. [rank0]:[titan] 2025-06-25 13:21:57,742 - root - INFO - step: 850 loss: 4.8118 memory: 49.66GiB(52.28%) tps: 5,846 tflops: 338.58 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:22:11,742 - root - INFO - step: 860 loss: 4.6158 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.94 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:22:25,735 - root - INFO - step: 870 loss: 4.5595 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.12 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:22:39,726 - root - INFO - step: 880 loss: 4.4991 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.17 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:22:53,710 - root - INFO - step: 890 loss: 4.7063 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.34 mfu: 34.31% [rank0]:[titan] 2025-06-25 13:23:06,134 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-25 13:23:07,717 - root - INFO - step: 900 loss: 4.7203 memory: 49.66GiB(52.28%) tps: 5,849 tflops: 338.74 mfu: 34.25% [rank0]:[titan] 2025-06-25 13:23:07,717 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:23:07,718 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds. [rank0]:[titan] 2025-06-25 13:23:07,738 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds. [rank0]:[titan] 2025-06-25 13:23:08,289 - root - INFO - Dumping profiler traces at step 900 [rank0]:[titan] 2025-06-25 13:23:08,491 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[0] 2025-06-25 13:23:09,410 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-900 [rank0]:[titan] 2025-06-25 13:23:22,968 - root - INFO - step: 910 loss: 4.4820 memory: 49.66GiB(52.28%) tps: 5,372 tflops: 311.11 mfu: 31.46% [rank0]:[titan] 2025-06-25 13:23:36,961 - root - INFO - step: 920 loss: 4.6184 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.10 mfu: 34.29% [rank0]:[titan] 2025-06-25 13:23:50,963 - root - INFO - step: 930 loss: 4.6420 memory: 49.66GiB(52.28%) tps: 5,852 tflops: 338.90 mfu: 34.27% [rank0]:[titan] 2025-06-25 13:24:04,987 - root - INFO - step: 940 loss: 4.5612 memory: 49.66GiB(52.28%) tps: 5,842 tflops: 338.35 mfu: 34.21% [rank0]:[0] 2025-06-25 13:24:13,578 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-900', uuid='13493b71-2fef-4543-94dc-65dd4c30d299') [rank0]:[0] 2025-06-25 13:24:13,578 - root - INFO - Waiting for checkpoint save request... [rank0]:[titan] 2025-06-25 13:24:17,438 - root - INFO - [GC] Peforming periodical GC collection. 0.09 seconds. [rank0]:[titan] 2025-06-25 13:24:18,991 - root - INFO - step: 950 loss: 4.6412 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.84 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:24:33,004 - root - INFO - step: 960 loss: 4.5748 memory: 49.66GiB(52.28%) tps: 5,847 tflops: 338.60 mfu: 34.24% [rank0]:[titan] 2025-06-25 13:24:47,007 - root - INFO - step: 970 loss: 4.6774 memory: 49.66GiB(52.28%) tps: 5,851 tflops: 338.85 mfu: 34.26% [rank0]:[titan] 2025-06-25 13:25:01,005 - root - INFO - step: 980 loss: 4.5435 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 338.99 mfu: 34.28% [rank0]:[titan] 2025-06-25 13:25:15,028 - root - INFO - step: 990 loss: 4.6884 memory: 49.66GiB(52.28%) tps: 5,842 tflops: 338.36 mfu: 34.21% [rank0]:[titan] 2025-06-25 13:25:27,494 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - step: 1000 loss: 4.6401 memory: 49.66GiB(52.28%) tps: 5,828 tflops: 337.50 mfu: 34.13% [rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - Saving the checkpoint (or staging if async is enabled). [rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - Saving a full checkpoint at last step, step 1000. [rank0]:[titan] 2025-06-25 13:26:18,295 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds. [rank0]:[titan] 2025-06-25 13:26:18,295 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 49.21 seconds. [rank0]:[titan] 2025-06-25 13:26:18,296 - root - INFO - Checkpointer is deleting ./outputs/checkpoint/step-1. [rank0]:[titan] 2025-06-25 13:26:19,208 - root - INFO - Dumping profiler traces at step 1000 [rank0]:[titan] 2025-06-25 13:26:19,433 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[titan] 2025-06-25 13:26:19,433 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:[titan] 2025-06-25 13:26:21,065 - root - INFO - Checkpointer deleted ./outputs/checkpoint/step-1 in 2.77 seconds. [rank0]:[titan] 2025-06-25 13:26:21,433 - root - INFO - Training completed [rank0]:[titan] 2025-06-25 13:26:21,434 - root - INFO - Destroying the purge thread. [rank0]:[titan] 2025-06-25 13:26:28,980 - root - INFO - Process group destroyed. ``` ### TorchTitan Training (Async+ Pinned Mem) Without DCP for 1000 steps on LLAMA3 8B Model ``` [rank0]:[titan] 2025-06-16 22:26:05,616 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters [rank0]:[titan] 2025-06-16 22:26:05,617 - root - INFO - Applied selective activation checkpointing to the model [rank0]:[titan] 2025-06-16 22:26:05,680 - root - INFO - Applied FSDP to the model [rank0]:[titan] 2025-06-16 22:26:05,942 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14 [rank0]:[titan] 2025-06-16 22:26:05,943 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%) [rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 1000 (warmup 200). [rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Training starts at step 1. [rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace [rank0]:[titan] 2025-06-16 22:26:11,033 - root - INFO - step: 1 loss: 12.2552 memory: 42.16GiB(44.38%) tps: 1,512 tflops: 87.59 mfu: 8.86% [rank0]:[titan] 2025-06-16 22:26:11,033 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-06-16 22:26:24,113 - root - INFO - step: 10 loss: 10.2261 memory: 49.66GiB(52.28%) tps: 5,638 tflops: 326.50 mfu: 33.01% [rank0]:[titan] 2025-06-16 22:26:37,933 - root - INFO - step: 20 loss: 8.5252 memory: 49.66GiB(52.28%) tps: 5,928 tflops: 343.32 mfu: 34.71% [rank0]:[titan] 2025-06-16 22:26:51,817 - root - INFO - step: 30 loss: 7.7284 memory: 49.66GiB(52.28%) tps: 5,902 tflops: 341.78 mfu: 34.56% [rank0]:[titan] 2025-06-16 22:27:05,726 - root - INFO - step: 40 loss: 7.3651 memory: 49.66GiB(52.28%) tps: 5,890 tflops: 341.14 mfu: 34.49% [rank0]:[titan] 2025-06-16 22:27:18,093 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-16 22:27:19,646 - root - INFO - step: 50 loss: 7.1575 memory: 49.66GiB(52.28%) tps: 5,886 tflops: 340.87 mfu: 34.47% [rank0]:[titan] 2025-06-16 22:27:33,572 - root - INFO - step: 60 loss: 6.8193 memory: 49.66GiB(52.28%) tps: 5,883 tflops: 340.72 mfu: 34.45% [rank0]:[titan] 2025-06-16 22:27:47,509 - root - INFO - step: 70 loss: 6.9294 memory: 49.66GiB(52.28%) tps: 5,879 tflops: 340.47 mfu: 34.43% [rank0]:[titan] 2025-06-16 22:28:01,442 - root - INFO - step: 80 loss: 6.6544 memory: 49.66GiB(52.28%) tps: 5,880 tflops: 340.56 mfu: 34.44% [rank0]:[titan] 2025-06-16 22:28:15,387 - root - INFO - step: 90 loss: 6.6736 memory: 49.66GiB(52.28%) tps: 5,875 tflops: 340.27 mfu: 34.41% [rank0]:[titan] 2025-06-16 22:28:28,168 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds. [rank0]:[titan] 2025-06-16 22:28:29,755 - root - INFO - step: 100 loss: 6.4444 memory: 49.66GiB(52.28%) tps: 5,702 tflops: 330.24 mfu: 33.39% [rank0]:[titan] 2025-06-16 22:28:30,372 - root - INFO - Dumping profiler traces at step 100 [rank0]:[titan] 2025-06-16 22:28:30,573 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:28:44,564 - root - INFO - step: 110 loss: 6.6835 memory: 49.66GiB(52.28%) tps: 5,532 tflops: 320.39 mfu: 32.40% [rank0]:[titan] 2025-06-16 22:28:58,531 - root - INFO - step: 120 loss: 6.3605 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.72 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:29:12,500 - root - INFO - step: 130 loss: 6.4930 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.69 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:29:26,476 - root - INFO - step: 140 loss: 6.4257 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.53 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:29:38,860 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds. [rank0]:[titan] 2025-06-16 22:29:40,448 - root - INFO - step: 150 loss: 6.0684 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.61 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:29:54,415 - root - INFO - step: 160 loss: 6.3202 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.72 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:30:08,386 - root - INFO - step: 170 loss: 5.9970 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.64 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:30:22,357 - root - INFO - step: 180 loss: 6.0888 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.65 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:30:36,333 - root - INFO - step: 190 loss: 6.7146 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.52 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:30:48,735 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:30:50,322 - root - INFO - step: 200 loss: 6.1384 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.16 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:30:50,758 - root - INFO - Dumping profiler traces at step 200 [rank0]:[titan] 2025-06-16 22:30:50,976 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[titan] 2025-06-16 22:31:04,970 - root - INFO - step: 210 loss: 6.0388 memory: 49.66GiB(52.28%) tps: 5,593 tflops: 323.92 mfu: 32.75% [rank0]:[titan] 2025-06-16 22:31:18,934 - root - INFO - step: 220 loss: 6.0348 memory: 49.66GiB(52.28%) tps: 5,868 tflops: 339.82 mfu: 34.36% [rank0]:[titan] 2025-06-16 22:31:32,895 - root - INFO - step: 230 loss: 5.9670 memory: 49.66GiB(52.28%) tps: 5,869 tflops: 339.88 mfu: 34.37% [rank0]:[titan] 2025-06-16 22:31:46,869 - root - INFO - step: 240 loss: 5.8311 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.59 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:31:59,261 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:32:00,846 - root - INFO - step: 250 loss: 5.8383 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.48 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:32:15,834 - root - INFO - step: 260 loss: 5.7541 memory: 49.66GiB(52.28%) tps: 5,467 tflops: 316.60 mfu: 32.01% [rank0]:[titan] 2025-06-16 22:32:29,803 - root - INFO - step: 270 loss: 5.7202 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.66 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:32:43,772 - root - INFO - step: 280 loss: 5.6326 memory: 49.66GiB(52.28%) tps: 5,866 tflops: 339.71 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:32:57,734 - root - INFO - step: 290 loss: 5.7467 memory: 49.66GiB(52.28%) tps: 5,868 tflops: 339.85 mfu: 34.36% [rank0]:[titan] 2025-06-16 22:33:10,148 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-16 22:33:11,718 - root - INFO - step: 300 loss: 5.7704 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.31 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:33:12,173 - root - INFO - Dumping profiler traces at step 300 [rank0]:[titan] 2025-06-16 22:33:12,371 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:33:26,359 - root - INFO - step: 310 loss: 5.6014 memory: 49.66GiB(52.28%) tps: 5,596 tflops: 324.09 mfu: 32.77% [rank0]:[titan] 2025-06-16 22:33:40,337 - root - INFO - step: 320 loss: 5.5262 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.47 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:33:54,315 - root - INFO - step: 330 loss: 5.4930 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.45 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:34:08,294 - root - INFO - step: 340 loss: 5.5682 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.44 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:34:20,693 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:34:22,279 - root - INFO - step: 350 loss: 5.5599 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.32 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:34:36,250 - root - INFO - step: 360 loss: 5.5070 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.63 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:34:50,242 - root - INFO - step: 370 loss: 5.4143 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.11 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:35:04,230 - root - INFO - step: 380 loss: 5.4226 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.23 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:35:18,206 - root - INFO - step: 390 loss: 5.3208 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.55 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:35:30,661 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds. [rank0]:[titan] 2025-06-16 22:35:32,251 - root - INFO - step: 400 loss: 5.4494 memory: 49.66GiB(52.28%) tps: 5,833 tflops: 337.83 mfu: 34.16% [rank0]:[titan] 2025-06-16 22:35:32,714 - root - INFO - Dumping profiler traces at step 400 [rank0]:[titan] 2025-06-16 22:35:32,916 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:35:46,913 - root - INFO - step: 410 loss: 5.2551 memory: 49.66GiB(52.28%) tps: 5,588 tflops: 323.63 mfu: 32.72% [rank0]:[titan] 2025-06-16 22:36:00,884 - root - INFO - step: 420 loss: 5.4090 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.64 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:36:14,870 - root - INFO - step: 430 loss: 5.3344 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.27 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:36:28,864 - root - INFO - step: 440 loss: 5.3106 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.10 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:36:41,255 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:36:42,843 - root - INFO - step: 450 loss: 5.2298 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.47 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:36:56,829 - root - INFO - step: 460 loss: 5.2365 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.26 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:37:10,821 - root - INFO - step: 470 loss: 5.1352 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.15 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:37:24,810 - root - INFO - step: 480 loss: 5.1063 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.20 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:37:38,791 - root - INFO - step: 490 loss: 5.1559 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.40 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:37:51,229 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds. [rank0]:[titan] 2025-06-16 22:37:52,788 - root - INFO - step: 500 loss: 5.2053 memory: 49.66GiB(52.28%) tps: 5,853 tflops: 339.00 mfu: 34.28% [rank0]:[titan] 2025-06-16 22:37:53,282 - root - INFO - Dumping profiler traces at step 500 [rank0]:[titan] 2025-06-16 22:37:53,479 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:38:07,509 - root - INFO - step: 510 loss: 5.0252 memory: 49.66GiB(52.28%) tps: 5,565 tflops: 322.31 mfu: 32.59% [rank0]:[titan] 2025-06-16 22:38:22,243 - root - INFO - step: 520 loss: 5.2369 memory: 49.66GiB(52.28%) tps: 5,561 tflops: 322.06 mfu: 32.56% [rank0]:[titan] 2025-06-16 22:38:36,217 - root - INFO - step: 530 loss: 5.1380 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.56 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:38:50,196 - root - INFO - step: 540 loss: 5.0538 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.44 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:39:02,608 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:39:04,186 - root - INFO - step: 550 loss: 5.0703 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.16 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:39:18,173 - root - INFO - step: 560 loss: 5.1104 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.27 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:39:32,166 - root - INFO - step: 570 loss: 4.9368 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.12 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:39:46,148 - root - INFO - step: 580 loss: 5.0037 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.38 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:40:00,139 - root - INFO - step: 590 loss: 5.0303 memory: 49.66GiB(52.28%) tps: 5,856 tflops: 339.16 mfu: 34.29% [rank0]:[titan] 2025-06-16 22:40:12,566 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:40:14,145 - root - INFO - step: 600 loss: 4.9233 memory: 49.66GiB(52.28%) tps: 5,850 tflops: 338.78 mfu: 34.25% [rank0]:[titan] 2025-06-16 22:40:14,671 - root - INFO - Dumping profiler traces at step 600 [rank0]:[titan] 2025-06-16 22:40:14,871 - root - INFO - Finished dumping profiler traces in 0.20 seconds [rank0]:[titan] 2025-06-16 22:40:28,831 - root - INFO - step: 610 loss: 5.0338 memory: 49.66GiB(52.28%) tps: 5,579 tflops: 323.10 mfu: 32.67% [rank0]:[titan] 2025-06-16 22:40:42,800 - root - INFO - step: 620 loss: 5.0437 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.68 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:40:56,779 - root - INFO - step: 630 loss: 4.8868 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.46 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:41:10,768 - root - INFO - step: 640 loss: 4.8542 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.19 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:41:23,176 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:41:24,756 - root - INFO - step: 650 loss: 5.0227 memory: 49.66GiB(52.28%) tps: 5,857 tflops: 339.23 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:41:38,741 - root - INFO - step: 660 loss: 4.8901 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.32 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:41:52,723 - root - INFO - step: 670 loss: 4.8374 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.36 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:42:06,707 - root - INFO - step: 680 loss: 4.9357 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.33 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:42:20,691 - root - INFO - step: 690 loss: 4.6904 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.33 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:42:33,106 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:42:34,686 - root - INFO - step: 700 loss: 4.8426 memory: 49.66GiB(52.28%) tps: 5,854 tflops: 339.02 mfu: 34.28% [rank0]:[titan] 2025-06-16 22:42:35,202 - root - INFO - Dumping profiler traces at step 700 [rank0]:[titan] 2025-06-16 22:42:35,411 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[titan] 2025-06-16 22:42:49,414 - root - INFO - step: 710 loss: 4.8630 memory: 49.66GiB(52.28%) tps: 5,563 tflops: 322.17 mfu: 32.58% [rank0]:[titan] 2025-06-16 22:43:03,389 - root - INFO - step: 720 loss: 4.8307 memory: 49.66GiB(52.28%) tps: 5,863 tflops: 339.54 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:43:17,366 - root - INFO - step: 730 loss: 5.0704 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.48 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:43:31,346 - root - INFO - step: 740 loss: 4.9245 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.40 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:43:43,762 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds. [rank0]:[titan] 2025-06-16 22:43:45,331 - root - INFO - step: 750 loss: 4.6614 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.31 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:43:59,308 - root - INFO - step: 760 loss: 4.7145 memory: 49.66GiB(52.28%) tps: 5,862 tflops: 339.50 mfu: 34.33% [rank0]:[titan] 2025-06-16 22:44:13,608 - root - INFO - step: 770 loss: 4.7226 memory: 49.66GiB(52.28%) tps: 5,730 tflops: 331.83 mfu: 33.55% [rank0]:[titan] 2025-06-16 22:44:28,628 - root - INFO - step: 780 loss: 4.7430 memory: 49.66GiB(52.28%) tps: 5,455 tflops: 315.93 mfu: 31.94% [rank0]:[titan] 2025-06-16 22:44:42,611 - root - INFO - step: 790 loss: 4.6875 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.35 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:44:55,024 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:44:56,605 - root - INFO - step: 800 loss: 4.6901 memory: 49.66GiB(52.28%) tps: 5,855 tflops: 339.06 mfu: 34.28% [rank0]:[titan] 2025-06-16 22:44:57,138 - root - INFO - Dumping profiler traces at step 800 [rank0]:[titan] 2025-06-16 22:44:57,344 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[titan] 2025-06-16 22:45:11,360 - root - INFO - step: 810 loss: 4.6838 memory: 49.66GiB(52.28%) tps: 5,552 tflops: 321.56 mfu: 32.51% [rank0]:[titan] 2025-06-16 22:45:25,305 - root - INFO - step: 820 loss: 4.6000 memory: 49.66GiB(52.28%) tps: 5,876 tflops: 340.28 mfu: 34.41% [rank0]:[titan] 2025-06-16 22:45:39,251 - root - INFO - step: 830 loss: 4.6733 memory: 49.66GiB(52.28%) tps: 5,875 tflops: 340.25 mfu: 34.40% [rank0]:[titan] 2025-06-16 22:45:53,208 - root - INFO - step: 840 loss: 4.6178 memory: 49.66GiB(52.28%) tps: 5,870 tflops: 339.98 mfu: 34.38% [rank0]:[titan] 2025-06-16 22:46:05,611 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:46:07,194 - root - INFO - step: 850 loss: 4.7753 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.27 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:46:21,172 - root - INFO - step: 860 loss: 4.5818 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.45 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:46:35,159 - root - INFO - step: 870 loss: 4.5332 memory: 49.66GiB(52.28%) tps: 5,858 tflops: 339.25 mfu: 34.30% [rank0]:[titan] 2025-06-16 22:46:49,143 - root - INFO - step: 880 loss: 4.4733 memory: 49.66GiB(52.28%) tps: 5,859 tflops: 339.34 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:47:03,125 - root - INFO - step: 890 loss: 4.6788 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.38 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:47:15,614 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:47:17,195 - root - INFO - step: 900 loss: 4.6943 memory: 49.66GiB(52.28%) tps: 5,823 tflops: 337.23 mfu: 34.10% [rank0]:[titan] 2025-06-16 22:47:17,789 - root - INFO - Dumping profiler traces at step 900 [rank0]:[titan] 2025-06-16 22:47:18,004 - root - INFO - Finished dumping profiler traces in 0.22 seconds [rank0]:[titan] 2025-06-16 22:47:31,972 - root - INFO - step: 910 loss: 4.4559 memory: 49.66GiB(52.28%) tps: 5,544 tflops: 321.10 mfu: 32.47% [rank0]:[titan] 2025-06-16 22:47:45,950 - root - INFO - step: 920 loss: 4.5920 memory: 49.66GiB(52.28%) tps: 5,861 tflops: 339.46 mfu: 34.32% [rank0]:[titan] 2025-06-16 22:47:59,923 - root - INFO - step: 930 loss: 4.6219 memory: 49.66GiB(52.28%) tps: 5,864 tflops: 339.58 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:48:13,889 - root - INFO - step: 940 loss: 4.5429 memory: 49.66GiB(52.28%) tps: 5,867 tflops: 339.76 mfu: 34.35% [rank0]:[titan] 2025-06-16 22:48:26,271 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds. [rank0]:[titan] 2025-06-16 22:48:27,849 - root - INFO - step: 950 loss: 4.6160 memory: 49.66GiB(52.28%) tps: 5,870 tflops: 339.93 mfu: 34.37% [rank0]:[titan] 2025-06-16 22:48:41,809 - root - INFO - step: 960 loss: 4.5495 memory: 49.66GiB(52.28%) tps: 5,869 tflops: 339.91 mfu: 34.37% [rank0]:[titan] 2025-06-16 22:48:55,764 - root - INFO - step: 970 loss: 4.6507 memory: 49.66GiB(52.28%) tps: 5,871 tflops: 340.03 mfu: 34.38% [rank0]:[titan] 2025-06-16 22:49:09,734 - root - INFO - step: 980 loss: 4.5222 memory: 49.66GiB(52.28%) tps: 5,865 tflops: 339.66 mfu: 34.34% [rank0]:[titan] 2025-06-16 22:49:23,717 - root - INFO - step: 990 loss: 4.6672 memory: 49.66GiB(52.28%) tps: 5,860 tflops: 339.37 mfu: 34.31% [rank0]:[titan] 2025-06-16 22:49:36,204 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds. [rank0]:[titan] 2025-06-16 22:49:37,745 - root - INFO - step: 1000 loss: 4.6165 memory: 49.66GiB(52.28%) tps: 5,840 tflops: 338.24 mfu: 34.20% [rank0]:[titan] 2025-06-16 22:49:38,344 - root - INFO - Dumping profiler traces at step 1000 [rank0]:[titan] 2025-06-16 22:49:38,552 - root - INFO - Finished dumping profiler traces in 0.21 seconds [rank0]:[titan] 2025-06-16 22:49:38,553 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:[titan] 2025-06-16 22:49:40,553 - root - INFO - Training completed [rank0]:[titan] 2025-06-16 22:49:41,030 - root - INFO - Process group destroyed. ``` --------- Co-authored-by: saiteja64 <saiteja64@devvm8023.cco0.facebook.com>
1 parent 7104125 commit 671c30c

File tree

2 files changed

+46
-128
lines changed

2 files changed

+46
-128
lines changed

tests/unit_tests/test_checkpoint.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -420,15 +420,15 @@ def test_async_save_calls_async_wait(self, mock_async_save, mock_new_group):
420420

421421
# First save schedules async
422422
manager.save(curr_step=10, last_step=False)
423-
future = manager.async_future
423+
future = manager.save_future
424424
future.result.assert_not_called()
425425

426426
# Second save should wait
427427
manager.save(curr_step=20, last_step=False)
428428
future.result.assert_called_once()
429429

430430
# New future created
431-
new_future = manager.async_future
431+
new_future = manager.save_future
432432
new_future.result.assert_not_called()
433433

434434
@mock.patch("torch.cuda.Stream")
@@ -446,8 +446,10 @@ def test_ft_async_save_calls_async_wait(
446446
Test that with FT enabled, AsyncMode.ASYNC via FT triggers correct waits.
447447
"""
448448
job_config = DummyJobConfig(job=self.job_config.job)
449-
job_config.checkpoint.async_mode = "disabled"
449+
job_config.checkpoint.async_mode = "async"
450450
ft_manager = mock.Mock()
451+
ft_manager.manager.return_value = mock.Mock()
452+
ft_manager.manager.participating_rank = mock.Mock(return_value=0)
451453
ft_manager.enabled = True
452454
states = {"trainer": torch.tensor([0])}
453455
manager = CheckpointManager(
@@ -461,16 +463,16 @@ def test_ft_async_save_calls_async_wait(
461463
)
462464

463465
# Initially no future
464-
self.assertIsNone(manager.async_future)
466+
self.assertIsNone(manager.save_future)
465467
manager.save(curr_step=5, last_step=False)
466-
self.assertIsNotNone(manager.async_future)
468+
self.assertIsNotNone(manager.save_future)
467469

468-
manager.async_future.result.assert_not_called()
469-
prev_future = manager.async_future
470+
manager.save_future.result.assert_not_called()
471+
prev_future = manager.save_future
470472
manager.save(curr_step=6, last_step=False)
471473
prev_future.result.assert_called_once()
472-
self.assertIsNotNone(manager.async_future)
473-
manager.async_future.result.assert_not_called()
474+
self.assertIsNotNone(manager.save_future)
475+
manager.save_future.result.assert_not_called()
474476

475477
@mock.patch("torch.distributed.get_rank", return_value=0)
476478
@mock.patch("torchtitan.components.checkpoint.dcp.save")

torchtitan/components/checkpoint.py

Lines changed: 35 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -17,22 +17,22 @@
1717
import torch
1818
import torch.distributed as dist
1919
import torch.distributed.checkpoint as dcp
20-
import torch.multiprocessing as mp
2120
import torch.nn as nn
22-
from torch.distributed._state_dict_utils import _copy_state_dict, _create_cpu_state_dict
21+
from torch.distributed.checkpoint.staging import DefaultStager, StagingOptions
2322
from torch.distributed.checkpoint.state_dict import (
2423
get_model_state_dict,
2524
set_model_state_dict,
2625
StateDictOptions,
2726
)
27+
from torch.distributed.checkpoint.state_dict_saver import AsyncCheckpointerType
2828
from torch.distributed.checkpoint.stateful import Stateful
2929
from torch.utils.data import DataLoader
3030

3131
from torchtitan.components.ft import FTManager
3232
from torchtitan.components.lr_scheduler import LRSchedulersContainer
3333
from torchtitan.components.optimizer import OptimizersContainer
3434
from torchtitan.config_manager import JobConfig, TORCH_DTYPE_MAP
35-
from torchtitan.tools.logging import init_logger, logger
35+
from torchtitan.tools.logging import logger
3636
from torchtitan.tools.utils import GarbageCollection
3737

3838

@@ -98,43 +98,6 @@ def save_with_gc(state, checkpoint_id):
9898
GarbageCollection.collect("GC collection invoked by checkpointer.")
9999

100100

101-
def checkpoint_mp(recv: mp.Queue, send: mp.Queue):
102-
"""Process to save the checkpoint in the background.
103-
104-
This is only used when async_checkpoint_with_pinned_memory is enabled.
105-
106-
Args:
107-
recv (mp.Queue): The queue to receive the state_dict and Terminate signal.
108-
send (mp.Queue): The queue to send the SaveDone signal.
109-
"""
110-
init_logger()
111-
os.environ["MASTER_PORT"] = str(int(os.environ["MASTER_PORT"]) + 2)
112-
os.environ["TORCHELASTIC_USE_AGENT_STORE"] = "False"
113-
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
114-
dist.init_process_group()
115-
try:
116-
while True:
117-
logger.debug("Checkpoint background process is done.")
118-
send.put(SaveDone())
119-
logger.debug("Wait for the new state_dict.")
120-
obj = recv.get()
121-
logger.debug("Received the new state_dict.")
122-
if isinstance(obj, Terminate):
123-
logger.info("Terminating the checkpoint background process.")
124-
return
125-
assert isinstance(obj, tuple)
126-
begin = time.monotonic()
127-
state, checkpoint_id = obj
128-
save_with_gc(state, checkpoint_id=checkpoint_id)
129-
logger.info(
130-
"Finish saving the checkpoint in the background process in %.2f seconds.",
131-
time.monotonic() - begin,
132-
)
133-
finally:
134-
logger.info("Destroying the process group.")
135-
dist.destroy_process_group()
136-
137-
138101
def purge_thread(purge_queue: queue.Queue):
139102
"""Thread to purge the old checkpoints.
140103
@@ -275,7 +238,7 @@ def load_state_dict(state_dict):
275238
self.sending_to_checkpoint_mp = False
276239
self.staging_id = None
277240
self.cpu_offload_state_dict = None
278-
self.staging_stream = torch.cuda.Stream() if self.enable_staging else None
241+
self.stager = None
279242

280243
self.folder = os.path.join(job_config.job.dump_folder, ckpt_config.folder)
281244

@@ -292,7 +255,11 @@ def load_state_dict(state_dict):
292255

293256
# Async checkpoint related fields.
294257
async_mode = ckpt_config.async_mode.lower()
295-
if async_mode == AsyncMode.ASYNC or self.ft_manager:
258+
if (
259+
async_mode == AsyncMode.ASYNC
260+
or async_mode == AsyncMode.ASYNC_WITH_PINNED_MEM
261+
or self.ft_manager
262+
):
296263
self.pg = dist.new_group(backend="gloo")
297264

298265
self.keep_latest_k = ckpt_config.keep_latest_k
@@ -311,25 +278,14 @@ def load_state_dict(state_dict):
311278
self.purge_thread = None
312279

313280
self.mp = None
314-
self.async_future = None
281+
self.staging_future = None
282+
self.save_future = None
315283
if async_mode == AsyncMode.DISABLED:
316284
self.async_mode = AsyncMode.DISABLED
317285
elif async_mode == AsyncMode.ASYNC:
318286
self.async_mode = AsyncMode.ASYNC
319287
elif async_mode == AsyncMode.ASYNC_WITH_PINNED_MEM:
320288
self.async_mode = AsyncMode.ASYNC_WITH_PINNED_MEM
321-
ctx = mp.get_context("spawn")
322-
self.mp_queue_send = ctx.Queue()
323-
self.mp_queue_recv = ctx.Queue()
324-
self.mp = ctx.Process(
325-
target=checkpoint_mp,
326-
args=(
327-
self.mp_queue_send,
328-
self.mp_queue_recv,
329-
),
330-
daemon=True,
331-
)
332-
self.mp.start()
333289
else:
334290
raise ValueError(f"Unkown checkpoint async_mode {ckpt_config.async_mode}")
335291

@@ -353,6 +309,9 @@ def close(self):
353309
self.purge_queue.put(Terminate())
354310
self.purge_thread.join()
355311

312+
if self.stager is not None:
313+
self.stager.close()
314+
356315
@torch.no_grad()
357316
def save(self, curr_step: int, last_step: bool = False) -> None:
358317
"""Save the checkpoint for the current step.
@@ -388,10 +347,20 @@ def save(self, curr_step: int, last_step: bool = False) -> None:
388347
self._save_last_step(curr_step)
389348
elif self.async_mode == AsyncMode.ASYNC_WITH_PINNED_MEM:
390349
GarbageCollection.collect("GC collection invoked by checkpointer.")
391-
self._async_with_pinned_memory(checkpoint_id)
350+
if self.stager is None:
351+
self.stager = DefaultStager(StagingOptions(True, True, True, True))
352+
result = dcp.async_save(
353+
self.states,
354+
checkpoint_id=checkpoint_id,
355+
process_group=self.pg,
356+
async_checkpointer_type=AsyncCheckpointerType.PROCESS,
357+
async_stager=self.stager,
358+
)
359+
self.save_future = result.upload_completion
360+
self.staging_future = result.staging_completion
392361
elif self.async_mode == AsyncMode.ASYNC:
393362
GarbageCollection.collect("GC collection invoked by checkpointer.")
394-
self.async_future = dcp.async_save(
363+
self.save_future = dcp.async_save(
395364
self.states, checkpoint_id=checkpoint_id, process_group=self.pg
396365
)
397366
GarbageCollection.collect("GC collection invoked by checkpointer.")
@@ -475,33 +444,7 @@ def maybe_wait_for_staging(self) -> None:
475444
with ``async_checkpoint_with_pinned_memory``.
476445
"""
477446
if self.enable_staging and self.staging:
478-
if not self.staging_stream.query():
479-
begin = time.monotonic()
480-
self.staging_stream.synchronize()
481-
logger.info(
482-
"Checkpointer waited staging %.2f seconds.",
483-
time.monotonic() - begin,
484-
)
485-
self.staging = False
486-
487-
if self.sending_to_checkpoint_mp:
488-
# Copy the sync staging result to another process.
489-
def sync_func():
490-
self.mp_queue_send.put_nowait(
491-
(self.cpu_offload_state_dict, self.staging_id)
492-
)
493-
494-
# This may be a faster way to do zero-overhead checkpointing staging
495-
# checkpointing but we need more thorough investigation before
496-
# swithing to this method.
497-
# self.my_thread = threading.Thread(target=func).start()
498-
begin = time.monotonic()
499-
sync_func()
500-
logger.info(
501-
"Checkpointer sent staged state_dict to another process %.2f seconds",
502-
time.monotonic() - begin,
503-
)
504-
self.sending_to_checkpoint_mp = False
447+
self.staging_future.result()
505448

506449
def _find_load_step(self, folder: str = "") -> int:
507450
"""Find the step to load the checkpoint for.
@@ -540,7 +483,7 @@ def _ft_save(self, step: int) -> None:
540483
begin = time.monotonic()
541484
self._async_wait()
542485
checkpoint_id = self._create_checkpoint_id(step, folder=self._ft_folder())
543-
self.async_future = dcp.async_save(
486+
self.save_future = dcp.async_save(
544487
self.ft_states, checkpoint_id=checkpoint_id, process_group=self.pg
545488
)
546489
logger.info(f"Staging ft checkpoint took {time.monotonic() - begin} secs.")
@@ -633,45 +576,18 @@ def _should_save(self, curr_step: int, last_step: bool = False) -> bool:
633576

634577
def _async_wait(self) -> None:
635578
if self.async_mode == AsyncMode.ASYNC_WITH_PINNED_MEM:
636-
logger.debug(
637-
f"Waiting for the background process to finish, {time.monotonic()=}.:.2f"
638-
)
639-
if not self.mp.is_alive():
640-
raise RuntimeError("The checkpoint background process is dead.")
641-
_ = self.mp_queue_recv.get()
579+
if self.save_future is not None:
580+
self.save_future.result()
642581
elif self.async_mode == AsyncMode.ASYNC or self.ft_manager is not None:
643-
if self.async_future is not None:
644-
self.async_future.result()
645-
self.async_future = None
646-
elif self.async_future is not None:
582+
if self.save_future is not None:
583+
self.save_future.result()
584+
self.save_future = None
585+
elif self.save_future is not None:
647586
raise RuntimeError(
648-
"self.async_future is not None, but self.async_mode is not enabled "
587+
"self.save_future is not None, but self.async_mode is not enabled "
649588
"and fault tolerance is not active."
650589
)
651590

652-
def _async_with_pinned_memory(self, checkpoint_id: str) -> None:
653-
self._cpu_staging(checkpoint_id)
654-
self.sending_to_checkpoint_mp = True
655-
656-
def _cpu_staging(self, checkpoint_id: str | None) -> None:
657-
"""Offload state_dict to CPU memory"""
658-
state_dict = dcp.state_dict_saver._stateful_to_state_dict(self.states)
659-
if self.cpu_offload_state_dict is None:
660-
logger.debug(f"Preparing the CPU memory, {time.monotonic()=}.:.2f")
661-
self.cpu_offload_state_dict = _create_cpu_state_dict(
662-
state_dict, pin_memory=True, share_memory=True
663-
)
664-
665-
logger.debug(f"Staging the state_dict, {time.monotonic()=}.:.2f")
666-
with torch.cuda.stream(self.staging_stream):
667-
self.cpu_offload_state_dict = _copy_state_dict(
668-
state_dict,
669-
self.cpu_offload_state_dict,
670-
non_blocking=True,
671-
)
672-
self.staging = True
673-
self.staging_id = checkpoint_id
674-
675591
def _purge_stale_checkpoints(self):
676592
if (
677593
self.keep_latest_k > 0

0 commit comments

Comments
 (0)