Titan changes to use DCP ZOC instead of titan default Async + Pinned Memory (#1287)

Saiteja64 · saiteja64 · web-flow · commit c31ff8b0532c · 2025-07-01T14:49:48.000-07:00
Test Titan changes to use DCP ZOC instead of titan default

### Loss Curve DCP + ZOC. Traing with DCP ZOC Until Step 500. Delete
Last Checkpoint (since it's sync save). Then Run Load and Run Training
Until Step 1000
&lt;img width="1126" alt="Screenshot 2025-06-25 at 3 59 54 PM"
src="https://github.com/user-attachments/assets/d5c7986d-9a03-402f-a664-239807285117"
/&gt;

### Loss Curve TorchTitan Async + Pinned Memory Without DCP ZOC
&lt;img width="936" alt="Screenshot 2025-06-25 at 4 05 50 PM"
src="https://github.com/user-attachments/assets/0f616460-076b-4ef9-ba53-e0e285c030e7"
/&gt;

### DCP + Titan Training on LLAMA3 8B Model For 500 Steps
```
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
+ NGPU=8
+ export LOG_RANK=0
+ LOG_RANK=0
+ CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/llama3_8b.toml
W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] 
W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] *****************************************
W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0625 12:54:54.267000 3608822 torch/distributed/run.py:774] *****************************************
[rank0]:[titan] 2025-06-25 12:55:00,591 - root - INFO - Starting job: Llama 3 8B training
[rank0]:[titan] 2025-06-25 12:55:02,714 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:[titan] 2025-06-25 12:55:02,716 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
[rank0]:[titan] 2025-06-25 12:55:02,717 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
[rank0]:NCCL version 2.27.3+cuda12.4
[rank0]:[rank0]:[W625 12:55:10.376686804 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[titan] 2025-06-25 12:55:11,222 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:[titan] 2025-06-25 12:55:11,223 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:[titan] 2025-06-25 12:55:17,789 - root - INFO - Building llama3 8B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=128001)
[rank0]:[titan] 2025-06-25 12:55:17,927 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250625-1255
[rank0]:[titan] 2025-06-25 12:55:17,928 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory
[rank0]:[titan] 2025-06-25 12:55:17,943 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:[titan] 2025-06-25 12:55:17,944 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-25 12:55:18,010 - root - INFO - Applied FSDP to the model
[rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[rank0]:[titan] 2025-06-25 12:55:18,276 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-06-25 12:55:18,276 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%)
[rank0]:[rank0]:[W625 12:55:18.048917586 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint
[rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 500 (warmup 200).
[rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Training starts at step 1.
[rank0]:[titan] 2025-06-25 12:55:18,304 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:[titan] 2025-06-25 12:55:24,823 - root - INFO - step:  1  loss: 12.2478  memory: 42.16GiB(44.38%)  tps: 1,191  tflops: 68.97  mfu: 6.97%
[rank0]:[titan] 2025-06-25 12:55:24,824 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 12:55:24,842 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds.
[rank0]:[titan] 2025-06-25 12:55:24,859 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.04 seconds.
[rank0]:[titan] 2025-06-25 12:55:24,859 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[0] 2025-06-25 12:55:32,097 - root - INFO - Initializing dist.ProcessGroup in checkpoint background process
[rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[rank0]:[W625 12:55:32.282661499 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[0] 2025-06-25 12:55:32,539 - root - INFO - Checkpoint background process is running...
[rank0]:[0] 2025-06-25 12:55:32,539 - root - INFO - Waiting for checkpoint save request...
[rank0]:[0] 2025-06-25 12:55:37,346 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-1
[rank0]:[titan] 2025-06-25 12:55:55,589 - root - INFO - step: 10  loss:  9.9493  memory: 49.66GiB(52.28%)  tps: 2,397  tflops: 138.80  mfu: 14.03%
[rank0]:[titan] 2025-06-25 12:56:09,436 - root - INFO - step: 20  loss:  8.5802  memory: 49.66GiB(52.28%)  tps: 5,917  tflops: 342.67  mfu: 34.65%
[rank0]:[titan] 2025-06-25 12:56:23,343 - root - INFO - step: 30  loss:  7.7218  memory: 49.66GiB(52.28%)  tps: 5,892  tflops: 341.21  mfu: 34.50%
[rank0]:[titan] 2025-06-25 12:56:37,275 - root - INFO - step: 40  loss:  7.3309  memory: 49.66GiB(52.28%)  tps: 5,881  tflops: 340.60  mfu: 34.44%
[rank0]:[0] 2025-06-25 12:56:46,074 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-1', uuid='320c8c66-1115-416f-a22a-50d86e094b3a')
[rank0]:[0] 2025-06-25 12:56:46,074 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 12:56:49,706 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds.
[rank0]:[titan] 2025-06-25 12:56:51,223 - root - INFO - step: 50  loss:  7.1024  memory: 49.66GiB(52.28%)  tps: 5,874  tflops: 340.20  mfu: 34.40%
[rank0]:[titan] 2025-06-25 12:57:05,194 - root - INFO - step: 60  loss:  6.8512  memory: 49.66GiB(52.28%)  tps: 5,864  tflops: 339.61  mfu: 34.34%
[rank0]:[titan] 2025-06-25 12:57:19,180 - root - INFO - step: 70  loss:  6.9118  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.30  mfu: 34.31%
[rank0]:[titan] 2025-06-25 12:57:33,158 - root - INFO - step: 80  loss:  6.6411  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.46  mfu: 34.32%
[rank0]:[titan] 2025-06-25 12:57:47,135 - root - INFO - step: 90  loss:  6.6820  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.50  mfu: 34.33%
[rank0]:[titan] 2025-06-25 12:58:00,303 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-25 12:58:01,883 - root - INFO - step: 100  loss:  6.4539  memory: 49.66GiB(52.28%)  tps: 5,555  tflops: 321.72  mfu: 32.53%
[rank0]:[titan] 2025-06-25 12:58:01,883 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 12:58:01,884 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 12:58:01,904 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 12:58:02,647 - root - INFO - Dumping profiler traces at step 100
[rank0]:[titan] 2025-06-25 12:58:02,863 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[0] 2025-06-25 12:58:03,725 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-100
[rank0]:[titan] 2025-06-25 12:58:17,301 - root - INFO - step: 110  loss:  6.6926  memory: 49.66GiB(52.28%)  tps: 5,314  tflops: 307.75  mfu: 31.12%
[rank0]:[titan] 2025-06-25 12:58:31,306 - root - INFO - step: 120  loss:  6.3697  memory: 49.66GiB(52.28%)  tps: 5,850  tflops: 338.81  mfu: 34.26%
[rank0]:[titan] 2025-06-25 12:58:45,299 - root - INFO - step: 130  loss:  6.4435  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.10  mfu: 34.29%
[rank0]:[titan] 2025-06-25 12:58:59,293 - root - INFO - step: 140  loss:  6.3808  memory: 49.66GiB(52.28%)  tps: 5,854  tflops: 339.05  mfu: 34.28%
[rank0]:[0] 2025-06-25 12:58:59,562 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-100', uuid='324a8712-2df4-4b32-ae9f-eb26516ce3ac')
[rank0]:[0] 2025-06-25 12:58:59,562 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 12:59:11,719 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 12:59:13,294 - root - INFO - step: 150  loss:  6.0557  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.92  mfu: 34.27%
[rank0]:[titan] 2025-06-25 12:59:27,291 - root - INFO - step: 160  loss:  6.3298  memory: 49.66GiB(52.28%)  tps: 5,854  tflops: 339.01  mfu: 34.28%
[rank0]:[titan] 2025-06-25 12:59:41,291 - root - INFO - step: 170  loss:  6.0259  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.92  mfu: 34.27%
[rank0]:[titan] 2025-06-25 12:59:55,283 - root - INFO - step: 180  loss:  6.0728  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.13  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:00:09,269 - root - INFO - step: 190  loss:  6.7391  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.26  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:00:21,734 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds.
[rank0]:[titan] 2025-06-25 13:00:23,274 - root - INFO - step: 200  loss:  6.1409  memory: 49.66GiB(52.28%)  tps: 5,850  tflops: 338.79  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:00:23,274 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:00:23,275 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:00:23,295 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:00:23,801 - root - INFO - Dumping profiler traces at step 200
[rank0]:[titan] 2025-06-25 13:00:24,009 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[0] 2025-06-25 13:00:24,833 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-200
[rank0]:[titan] 2025-06-25 13:00:38,511 - root - INFO - step: 210  loss:  6.0412  memory: 49.66GiB(52.28%)  tps: 5,377  tflops: 311.41  mfu: 31.49%
[rank0]:[titan] 2025-06-25 13:00:52,498 - root - INFO - step: 220  loss:  6.0333  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.22  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:01:06,488 - root - INFO - step: 230  loss:  5.9720  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.18  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:01:20,477 - root - INFO - step: 240  loss:  5.8235  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.21  mfu: 34.30%
[rank0]:[0] 2025-06-25 13:01:21,487 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-200', uuid='9a1e278a-8372-4bf5-b7f7-a7fd2b80cc5e')
[rank0]:[0] 2025-06-25 13:01:21,487 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:01:32,911 - root - INFO - [GC] Peforming periodical GC collection. 0.09 seconds.
[rank0]:[titan] 2025-06-25 13:01:34,464 - root - INFO - step: 250  loss:  5.8334  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.25  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:01:49,456 - root - INFO - step: 260  loss:  5.7444  memory: 49.66GiB(52.28%)  tps: 5,465  tflops: 316.51  mfu: 32.00%
[rank0]:[titan] 2025-06-25 13:02:03,466 - root - INFO - step: 270  loss:  5.7170  memory: 49.66GiB(52.28%)  tps: 5,848  tflops: 338.69  mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:02:17,468 - root - INFO - step: 280  loss:  5.6201  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.87  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:02:31,468 - root - INFO - step: 290  loss:  5.7282  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.93  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:02:43,898 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 13:02:45,474 - root - INFO - step: 300  loss:  5.7546  memory: 49.66GiB(52.28%)  tps: 5,849  tflops: 338.76  mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:02:45,475 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:02:45,476 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:02:45,496 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:02:46,012 - root - INFO - Dumping profiler traces at step 300
[rank0]:[titan] 2025-06-25 13:02:46,220 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[0] 2025-06-25 13:02:47,106 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-300
[rank0]:[titan] 2025-06-25 13:03:00,674 - root - INFO - step: 310  loss:  5.5701  memory: 49.66GiB(52.28%)  tps: 5,390  tflops: 312.17  mfu: 31.56%
[rank0]:[titan] 2025-06-25 13:03:14,668 - root - INFO - step: 320  loss:  5.5060  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.09  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:03:28,667 - root - INFO - step: 330  loss:  5.4723  memory: 49.66GiB(52.28%)  tps: 5,853  tflops: 338.96  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:03:42,657 - root - INFO - step: 340  loss:  5.5342  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.18  mfu: 34.29%
[rank0]:[0] 2025-06-25 13:03:45,717 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-300', uuid='c7b82f66-2746-49c4-b626-f527beec83cb')
[rank0]:[0] 2025-06-25 13:03:45,718 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:03:55,085 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-25 13:03:56,649 - root - INFO - step: 350  loss:  5.5262  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.11  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:04:10,642 - root - INFO - step: 360  loss:  5.4645  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.11  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:04:24,639 - root - INFO - step: 370  loss:  5.3705  memory: 49.66GiB(52.28%)  tps: 5,854  tflops: 339.02  mfu: 34.28%
[rank0]:[titan] 2025-06-25 13:04:38,629 - root - INFO - step: 380  loss:  5.3711  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.18  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:04:52,611 - root - INFO - step: 390  loss:  5.2803  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.37  mfu: 34.31%
[rank0]:[titan] 2025-06-25 13:05:05,085 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds.
[rank0]:[titan] 2025-06-25 13:05:06,613 - root - INFO - step: 400  loss:  5.4034  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.86  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:05:06,614 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:05:06,615 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:05:06,642 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.03 seconds.
[rank0]:[titan] 2025-06-25 13:05:07,153 - root - INFO - Dumping profiler traces at step 400
[rank0]:[titan] 2025-06-25 13:05:07,351 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:05:08,248 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-400
[rank0]:[titan] 2025-06-25 13:05:21,858 - root - INFO - step: 410  loss:  5.2031  memory: 49.66GiB(52.28%)  tps: 5,374  tflops: 311.24  mfu: 31.47%
[rank0]:[titan] 2025-06-25 13:05:35,829 - root - INFO - step: 420  loss:  5.3476  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.65  mfu: 34.34%
[rank0]:[titan] 2025-06-25 13:05:49,793 - root - INFO - step: 430  loss:  5.2579  memory: 49.66GiB(52.28%)  tps: 5,867  tflops: 339.80  mfu: 34.36%
[rank0]:[titan] 2025-06-25 13:06:03,755 - root - INFO - step: 440  loss:  5.2553  memory: 49.66GiB(52.28%)  tps: 5,868  tflops: 339.85  mfu: 34.36%
[rank0]:[0] 2025-06-25 13:06:06,800 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-400', uuid='22d8ae0c-f9c7-4424-ac4b-ab4fdaa4655f')
[rank0]:[0] 2025-06-25 13:06:06,800 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:06:16,214 - root - INFO - [GC] Peforming periodical GC collection. 0.12 seconds.
[rank0]:[titan] 2025-06-25 13:06:17,743 - root - INFO - step: 450  loss:  5.1627  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.25  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:06:31,746 - root - INFO - step: 460  loss:  5.1879  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.86  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:06:45,740 - root - INFO - step: 470  loss:  5.0814  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.07  mfu: 34.28%
[rank0]:[titan] 2025-06-25 13:06:59,746 - root - INFO - step: 480  loss:  5.0594  memory: 49.66GiB(52.28%)  tps: 5,850  tflops: 338.80  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:07:13,748 - root - INFO - step: 490  loss:  5.1015  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.88  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:07:26,234 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-25 13:07:27,814 - root - INFO - step: 500  loss:  5.1793  memory: 49.66GiB(52.28%)  tps: 5,824  tflops: 337.30  mfu: 34.11%
[rank0]:[titan] 2025-06-25 13:07:27,815 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:07:27,815 - root - INFO - Saving a full checkpoint at last step, step 500.
[rank0]:[titan] 2025-06-25 13:08:17,506 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:08:17,506 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 49.69 seconds.
[rank0]:[titan] 2025-06-25 13:08:18,133 - root - INFO - Dumping profiler traces at step 500
[rank0]:[titan] 2025-06-25 13:08:18,369 - root - INFO - Finished dumping profiler traces in 0.24 seconds
[rank0]:[titan] 2025-06-25 13:08:18,369 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-25 13:08:20,369 - root - INFO - Training completed
[rank0]:[titan] 2025-06-25 13:08:20,370 - root - INFO - Destroying the purge thread.
[rank0]:[titan] 2025-06-25 13:08:24,141 - root - INFO - Process group destroyed.
```
### Delete Checkpoint at 500th Step (Since this was sync saved). Run
LLAMA3 8B model training to 1000 steps
```
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
+ NGPU=8
+ export LOG_RANK=0
+ LOG_RANK=0
+ CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml
+ overrides=
+ '[' 0 -ne 0 ']'
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/llama3_8b.toml
W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] 
W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] *****************************************
W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0625 13:10:21.529000 4085013 torch/distributed/run.py:774] *****************************************
[rank0]:[titan] 2025-06-25 13:10:27,930 - root - INFO - Starting job: Llama 3 8B training
[rank0]:[titan] 2025-06-25 13:10:32,443 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:[titan] 2025-06-25 13:10:32,445 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
[rank0]:[titan] 2025-06-25 13:10:32,447 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
[rank0]:NCCL version 2.27.3+cuda12.4
[rank0]:[rank0]:[W625 13:10:37.522041468 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[titan] 2025-06-25 13:10:38,277 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:[titan] 2025-06-25 13:10:38,277 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:[titan] 2025-06-25 13:10:43,788 - root - INFO - Building llama3 8B with TransformerModelArgs(_enforced='This field is used to enforce all fields have defaults.', dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, use_flex_attn=False, attn_mask_type='causal', eos_id=128001)
[rank0]:[titan] 2025-06-25 13:10:43,930 - root - INFO - TensorBoard logging enabled. Logs will be saved at ./outputs/tb/20250625-1310
[rank0]:[titan] 2025-06-25 13:10:43,931 - root - INFO - CUDA capacity: NVIDIA H100 with 94.99GiB memory
[rank0]:[titan] 2025-06-25 13:10:43,946 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:[titan] 2025-06-25 13:10:43,947 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-25 13:10:44,010 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-06-25 13:10:44,268 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-06-25 13:10:44,268 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%)
[rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[rank0]:[rank0]:[W625 13:10:45.257529860 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint
[rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 1000 (warmup 200).
[rank0]:[titan] 2025-06-25 13:10:45,513 - root - INFO - Loading the checkpoint at step 400.
[rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - [GC] GC collection for checkpoint loading. 0.03 seconds.
[rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - Finished loading the checkpoint in 11.31 seconds.
[rank0]:[titan] 2025-06-25 13:10:56,821 - root - INFO - Training starts at step 401.
[rank0]:[titan] 2025-06-25 13:10:56,822 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:[titan] 2025-06-25 13:11:13,774 - root - INFO - step: 410  loss:  5.2593  memory: 49.66GiB(52.28%)  tps: 2,746  tflops: 159.06  mfu: 16.08%
[rank0]:[titan] 2025-06-25 13:11:27,654 - root - INFO - step: 420  loss:  5.4461  memory: 49.66GiB(52.28%)  tps: 5,903  tflops: 341.86  mfu: 34.57%
[rank0]:[titan] 2025-06-25 13:11:41,595 - root - INFO - step: 430  loss:  5.3764  memory: 49.66GiB(52.28%)  tps: 5,877  tflops: 340.34  mfu: 34.41%
[rank0]:[titan] 2025-06-25 13:11:55,562 - root - INFO - step: 440  loss:  5.3640  memory: 49.66GiB(52.28%)  tps: 5,866  tflops: 339.74  mfu: 34.35%
[rank0]:[titan] 2025-06-25 13:12:07,987 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-25 13:12:09,550 - root - INFO - step: 450  loss:  5.2764  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.23  mfu: 34.30%
[rank0]:[titan] 2025-06-25 13:12:23,550 - root - INFO - step: 460  loss:  5.2871  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.91  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:12:37,552 - root - INFO - step: 470  loss:  5.1846  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.90  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:12:51,565 - root - INFO - step: 480  loss:  5.1532  memory: 49.66GiB(52.28%)  tps: 5,847  tflops: 338.61  mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:13:05,570 - root - INFO - step: 490  loss:  5.2000  memory: 49.66GiB(52.28%)  tps: 5,850  tflops: 338.82  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:13:19,252 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 13:13:20,824 - root - INFO - step: 500  loss:  5.2609  memory: 49.66GiB(52.28%)  tps: 5,371  tflops: 311.03  mfu: 31.45%
[rank0]:[titan] 2025-06-25 13:13:20,825 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:13:20,826 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:13:20,844 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:13:21,785 - root - INFO - Dumping profiler traces at step 500
[rank0]:[titan] 2025-06-25 13:13:22,009 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[0] 2025-06-25 13:13:30,315 - root - INFO - Initializing dist.ProcessGroup in checkpoint background process
[rank0]:[W625 13:13:30.320876720 Utils.hpp:137] Warning: Environment variable TORCH_NCCL_TRACE_BUFFER_SIZE is deprecated; use TORCH_FR_BUFFER_SIZE instead (function operator())
[rank0]:[0] 2025-06-25 13:13:30,577 - root - INFO - Checkpoint background process is running...
[rank0]:[0] 2025-06-25 13:13:30,577 - root - INFO - Waiting for checkpoint save request...
[rank0]:[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[rank0]:[0] 2025-06-25 13:13:35,906 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-500
[rank0]:[titan] 2025-06-25 13:13:55,786 - root - INFO - step: 510  loss:  5.0734  memory: 49.66GiB(52.28%)  tps: 2,343  tflops: 135.71  mfu: 13.72%
[rank0]:[titan] 2025-06-25 13:14:09,750 - root - INFO - step: 520  loss:  5.2781  memory: 49.66GiB(52.28%)  tps: 5,867  tflops: 339.80  mfu: 34.36%
[rank0]:[titan] 2025-06-25 13:14:23,740 - root - INFO - step: 530  loss:  5.1984  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.15  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:14:37,748 - root - INFO - step: 540  loss:  5.0945  memory: 49.66GiB(52.28%)  tps: 5,849  tflops: 338.73  mfu: 34.25%
[rank0]:[0] 2025-06-25 13:14:46,826 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-500', uuid='851c2b13-e42f-4630-9a50-edb23b6a52ac')
[rank0]:[0] 2025-06-25 13:14:46,826 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:14:50,219 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds.
[rank0]:[titan] 2025-06-25 13:14:51,767 - root - INFO - step: 550  loss:  5.1208  memory: 49.66GiB(52.28%)  tps: 5,844  tflops: 338.47  mfu: 34.22%
[rank0]:[titan] 2025-06-25 13:15:05,767 - root - INFO - step: 560  loss:  5.1587  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.93  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:15:19,775 - root - INFO - step: 570  loss:  4.9740  memory: 49.66GiB(52.28%)  tps: 5,849  tflops: 338.73  mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:15:33,792 - root - INFO - step: 580  loss:  5.0386  memory: 49.66GiB(52.28%)  tps: 5,845  tflops: 338.50  mfu: 34.23%
[rank0]:[titan] 2025-06-25 13:15:47,811 - root - INFO - step: 590  loss:  5.0741  memory: 49.66GiB(52.28%)  tps: 5,845  tflops: 338.48  mfu: 34.22%
[rank0]:[titan] 2025-06-25 13:16:00,259 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 13:16:01,831 - root - INFO - step: 600  loss:  4.9624  memory: 49.66GiB(52.28%)  tps: 5,844  tflops: 338.43  mfu: 34.22%
[rank0]:[titan] 2025-06-25 13:16:01,831 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:16:01,833 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:16:01,853 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:16:02,368 - root - INFO - Dumping profiler traces at step 600
[rank0]:[titan] 2025-06-25 13:16:02,566 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:16:03,466 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-600
[rank0]:[titan] 2025-06-25 13:16:17,066 - root - INFO - step: 610  loss:  5.0695  memory: 49.66GiB(52.28%)  tps: 5,378  tflops: 311.46  mfu: 31.49%
[rank0]:[titan] 2025-06-25 13:16:31,050 - root - INFO - step: 620  loss:  5.0808  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.31  mfu: 34.31%
[rank0]:[titan] 2025-06-25 13:16:45,041 - root - INFO - step: 630  loss:  4.9225  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.14  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:16:59,065 - root - INFO - step: 640  loss:  4.8939  memory: 49.66GiB(52.28%)  tps: 5,842  tflops: 338.36  mfu: 34.21%
[rank0]:[0] 2025-06-25 13:17:05,017 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-600', uuid='05657add-86bb-4707-b3d6-3494d586c34d')
[rank0]:[0] 2025-06-25 13:17:05,017 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:17:11,847 - root - INFO - [GC] Peforming periodical GC collection. 0.23 seconds.
[rank0]:[titan] 2025-06-25 13:17:13,311 - root - INFO - step: 650  loss:  5.0611  memory: 49.66GiB(52.28%)  tps: 5,751  tflops: 333.05  mfu: 33.68%
[rank0]:[titan] 2025-06-25 13:17:27,609 - root - INFO - step: 660  loss:  4.9348  memory: 49.66GiB(52.28%)  tps: 5,731  tflops: 331.88  mfu: 33.56%
[rank0]:[titan] 2025-06-25 13:17:41,811 - root - INFO - step: 670  loss:  4.8798  memory: 49.66GiB(52.28%)  tps: 5,769  tflops: 334.13  mfu: 33.78%
[rank0]:[titan] 2025-06-25 13:17:56,001 - root - INFO - step: 680  loss:  4.9749  memory: 49.66GiB(52.28%)  tps: 5,774  tflops: 334.40  mfu: 33.81%
[rank0]:[titan] 2025-06-25 13:18:10,004 - root - INFO - step: 690  loss:  4.7231  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.86  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:18:22,447 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-25 13:18:24,013 - root - INFO - step: 700  loss:  4.8765  memory: 49.66GiB(52.28%)  tps: 5,848  tflops: 338.69  mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:18:24,014 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:18:24,015 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:18:24,034 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:18:24,557 - root - INFO - Dumping profiler traces at step 700
[rank0]:[titan] 2025-06-25 13:18:24,762 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:18:25,616 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-700
[rank0]:[titan] 2025-06-25 13:18:39,233 - root - INFO - step: 710  loss:  4.8973  memory: 49.66GiB(52.28%)  tps: 5,383  tflops: 311.76  mfu: 31.52%
[rank0]:[titan] 2025-06-25 13:18:53,245 - root - INFO - step: 720  loss:  4.8671  memory: 49.66GiB(52.28%)  tps: 5,848  tflops: 338.67  mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:19:07,252 - root - INFO - step: 730  loss:  5.1143  memory: 49.66GiB(52.28%)  tps: 5,849  tflops: 338.76  mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:19:21,270 - root - INFO - step: 740  loss:  4.9568  memory: 49.66GiB(52.28%)  tps: 5,845  tflops: 338.51  mfu: 34.23%
[rank0]:[0] 2025-06-25 13:19:28,527 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-700', uuid='40580106-cb24-4788-a99d-b10ff483e779')
[rank0]:[0] 2025-06-25 13:19:28,527 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:19:33,755 - root - INFO - [GC] Peforming periodical GC collection. 0.11 seconds.
[rank0]:[titan] 2025-06-25 13:19:35,288 - root - INFO - step: 750  loss:  4.6943  memory: 49.66GiB(52.28%)  tps: 5,845  tflops: 338.49  mfu: 34.23%
[rank0]:[titan] 2025-06-25 13:19:49,292 - root - INFO - step: 760  loss:  4.7441  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.85  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:20:03,668 - root - INFO - step: 770  loss:  4.7597  memory: 49.66GiB(52.28%)  tps: 5,699  tflops: 330.05  mfu: 33.37%
[rank0]:[titan] 2025-06-25 13:20:18,428 - root - INFO - step: 780  loss:  4.7751  memory: 49.66GiB(52.28%)  tps: 5,551  tflops: 321.47  mfu: 32.50%
[rank0]:[titan] 2025-06-25 13:20:32,447 - root - INFO - step: 790  loss:  4.7187  memory: 49.66GiB(52.28%)  tps: 5,845  tflops: 338.49  mfu: 34.23%
[rank0]:[titan] 2025-06-25 13:20:44,878 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-25 13:20:46,458 - root - INFO - step: 800  loss:  4.7215  memory: 49.66GiB(52.28%)  tps: 5,847  tflops: 338.65  mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:20:46,458 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:20:46,459 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:20:46,479 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:20:47,058 - root - INFO - Dumping profiler traces at step 800
[rank0]:[titan] 2025-06-25 13:20:47,260 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:20:48,131 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-800
[rank0]:[titan] 2025-06-25 13:21:01,733 - root - INFO - step: 810  loss:  4.7147  memory: 49.66GiB(52.28%)  tps: 5,364  tflops: 310.63  mfu: 31.41%
[rank0]:[titan] 2025-06-25 13:21:15,731 - root - INFO - step: 820  loss:  4.6326  memory: 49.66GiB(52.28%)  tps: 5,853  tflops: 338.97  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:21:29,735 - root - INFO - step: 830  loss:  4.7056  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.85  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:21:43,727 - root - INFO - step: 840  loss:  4.6464  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.13  mfu: 34.29%
[rank0]:[0] 2025-06-25 13:21:53,837 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-800', uuid='a5a2aa11-d40c-4d6e-8f58-a21181f08eaf')
[rank0]:[0] 2025-06-25 13:21:53,837 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:21:56,225 - root - INFO - [GC] Peforming periodical GC collection. 0.13 seconds.
[rank0]:[titan] 2025-06-25 13:21:57,742 - root - INFO - step: 850  loss:  4.8118  memory: 49.66GiB(52.28%)  tps: 5,846  tflops: 338.58  mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:22:11,742 - root - INFO - step: 860  loss:  4.6158  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.94  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:22:25,735 - root - INFO - step: 870  loss:  4.5595  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.12  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:22:39,726 - root - INFO - step: 880  loss:  4.4991  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.17  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:22:53,710 - root - INFO - step: 890  loss:  4.7063  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.34  mfu: 34.31%
[rank0]:[titan] 2025-06-25 13:23:06,134 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-25 13:23:07,717 - root - INFO - step: 900  loss:  4.7203  memory: 49.66GiB(52.28%)  tps: 5,849  tflops: 338.74  mfu: 34.25%
[rank0]:[titan] 2025-06-25 13:23:07,717 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:23:07,718 - root - INFO - [GC] GC collection invoked by checkpointer. 0.00 seconds.
[rank0]:[titan] 2025-06-25 13:23:07,738 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:23:08,289 - root - INFO - Dumping profiler traces at step 900
[rank0]:[titan] 2025-06-25 13:23:08,491 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[0] 2025-06-25 13:23:09,410 - root - INFO - Received async checkpoint request with id=./outputs/checkpoint/step-900
[rank0]:[titan] 2025-06-25 13:23:22,968 - root - INFO - step: 910  loss:  4.4820  memory: 49.66GiB(52.28%)  tps: 5,372  tflops: 311.11  mfu: 31.46%
[rank0]:[titan] 2025-06-25 13:23:36,961 - root - INFO - step: 920  loss:  4.6184  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.10  mfu: 34.29%
[rank0]:[titan] 2025-06-25 13:23:50,963 - root - INFO - step: 930  loss:  4.6420  memory: 49.66GiB(52.28%)  tps: 5,852  tflops: 338.90  mfu: 34.27%
[rank0]:[titan] 2025-06-25 13:24:04,987 - root - INFO - step: 940  loss:  4.5612  memory: 49.66GiB(52.28%)  tps: 5,842  tflops: 338.35  mfu: 34.21%
[rank0]:[0] 2025-06-25 13:24:13,578 - root - INFO - Submitted checkpoint save request for checkpoint_id=_CheckpointRequestIdentifier(checkpoint_id='./outputs/checkpoint/step-900', uuid='13493b71-2fef-4543-94dc-65dd4c30d299')
[rank0]:[0] 2025-06-25 13:24:13,578 - root - INFO - Waiting for checkpoint save request...
[rank0]:[titan] 2025-06-25 13:24:17,438 - root - INFO - [GC] Peforming periodical GC collection. 0.09 seconds.
[rank0]:[titan] 2025-06-25 13:24:18,991 - root - INFO - step: 950  loss:  4.6412  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.84  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:24:33,004 - root - INFO - step: 960  loss:  4.5748  memory: 49.66GiB(52.28%)  tps: 5,847  tflops: 338.60  mfu: 34.24%
[rank0]:[titan] 2025-06-25 13:24:47,007 - root - INFO - step: 970  loss:  4.6774  memory: 49.66GiB(52.28%)  tps: 5,851  tflops: 338.85  mfu: 34.26%
[rank0]:[titan] 2025-06-25 13:25:01,005 - root - INFO - step: 980  loss:  4.5435  memory: 49.66GiB(52.28%)  tps: 5,853  tflops: 338.99  mfu: 34.28%
[rank0]:[titan] 2025-06-25 13:25:15,028 - root - INFO - step: 990  loss:  4.6884  memory: 49.66GiB(52.28%)  tps: 5,842  tflops: 338.36  mfu: 34.21%
[rank0]:[titan] 2025-06-25 13:25:27,494 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - step: 1000  loss:  4.6401  memory: 49.66GiB(52.28%)  tps: 5,828  tflops: 337.50  mfu: 34.13%
[rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - Saving the checkpoint (or staging if async is enabled).
[rank0]:[titan] 2025-06-25 13:25:29,087 - root - INFO - Saving a full checkpoint at last step, step 1000.
[rank0]:[titan] 2025-06-25 13:26:18,295 - root - INFO - [GC] GC collection invoked by checkpointer. 0.02 seconds.
[rank0]:[titan] 2025-06-25 13:26:18,295 - root - INFO - Finished saving the checkpoint (or staging if async is enabled)in 49.21 seconds.
[rank0]:[titan] 2025-06-25 13:26:18,296 - root - INFO - Checkpointer is deleting ./outputs/checkpoint/step-1.
[rank0]:[titan] 2025-06-25 13:26:19,208 - root - INFO - Dumping profiler traces at step 1000
[rank0]:[titan] 2025-06-25 13:26:19,433 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[titan] 2025-06-25 13:26:19,433 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-25 13:26:21,065 - root - INFO - Checkpointer deleted ./outputs/checkpoint/step-1 in 2.77 seconds.
[rank0]:[titan] 2025-06-25 13:26:21,433 - root - INFO - Training completed
[rank0]:[titan] 2025-06-25 13:26:21,434 - root - INFO - Destroying the purge thread.
[rank0]:[titan] 2025-06-25 13:26:28,980 - root - INFO - Process group destroyed.
```

### TorchTitan Training (Async+ Pinned Mem) Without DCP for 1000 steps
on LLAMA3 8B Model
```
[rank0]:[titan] 2025-06-16 22:26:05,616 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters
[rank0]:[titan] 2025-06-16 22:26:05,617 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-16 22:26:05,680 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-06-16 22:26:05,942 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-06-16 22:26:05,943 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%)
[rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Trainer is initialized with local batch size 1, global batch size 8, gradient accumulation steps 1, sequence length 8192, total steps 1000 (warmup 200).
[rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Training starts at step 1.
[rank0]:[titan] 2025-06-16 22:26:05,944 - root - INFO - Profiling active. Traces will be saved at ./outputs/profile_trace
[rank0]:[titan] 2025-06-16 22:26:11,033 - root - INFO - step:  1  loss: 12.2552  memory: 42.16GiB(44.38%)  tps: 1,512  tflops: 87.59  mfu: 8.86%
[rank0]:[titan] 2025-06-16 22:26:11,033 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-06-16 22:26:24,113 - root - INFO - step: 10  loss: 10.2261  memory: 49.66GiB(52.28%)  tps: 5,638  tflops: 326.50  mfu: 33.01%
[rank0]:[titan] 2025-06-16 22:26:37,933 - root - INFO - step: 20  loss:  8.5252  memory: 49.66GiB(52.28%)  tps: 5,928  tflops: 343.32  mfu: 34.71%
[rank0]:[titan] 2025-06-16 22:26:51,817 - root - INFO - step: 30  loss:  7.7284  memory: 49.66GiB(52.28%)  tps: 5,902  tflops: 341.78  mfu: 34.56%
[rank0]:[titan] 2025-06-16 22:27:05,726 - root - INFO - step: 40  loss:  7.3651  memory: 49.66GiB(52.28%)  tps: 5,890  tflops: 341.14  mfu: 34.49%
[rank0]:[titan] 2025-06-16 22:27:18,093 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-16 22:27:19,646 - root - INFO - step: 50  loss:  7.1575  memory: 49.66GiB(52.28%)  tps: 5,886  tflops: 340.87  mfu: 34.47%
[rank0]:[titan] 2025-06-16 22:27:33,572 - root - INFO - step: 60  loss:  6.8193  memory: 49.66GiB(52.28%)  tps: 5,883  tflops: 340.72  mfu: 34.45%
[rank0]:[titan] 2025-06-16 22:27:47,509 - root - INFO - step: 70  loss:  6.9294  memory: 49.66GiB(52.28%)  tps: 5,879  tflops: 340.47  mfu: 34.43%
[rank0]:[titan] 2025-06-16 22:28:01,442 - root - INFO - step: 80  loss:  6.6544  memory: 49.66GiB(52.28%)  tps: 5,880  tflops: 340.56  mfu: 34.44%
[rank0]:[titan] 2025-06-16 22:28:15,387 - root - INFO - step: 90  loss:  6.6736  memory: 49.66GiB(52.28%)  tps: 5,875  tflops: 340.27  mfu: 34.41%
[rank0]:[titan] 2025-06-16 22:28:28,168 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds.
[rank0]:[titan] 2025-06-16 22:28:29,755 - root - INFO - step: 100  loss:  6.4444  memory: 49.66GiB(52.28%)  tps: 5,702  tflops: 330.24  mfu: 33.39%
[rank0]:[titan] 2025-06-16 22:28:30,372 - root - INFO - Dumping profiler traces at step 100
[rank0]:[titan] 2025-06-16 22:28:30,573 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:28:44,564 - root - INFO - step: 110  loss:  6.6835  memory: 49.66GiB(52.28%)  tps: 5,532  tflops: 320.39  mfu: 32.40%
[rank0]:[titan] 2025-06-16 22:28:58,531 - root - INFO - step: 120  loss:  6.3605  memory: 49.66GiB(52.28%)  tps: 5,866  tflops: 339.72  mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:29:12,500 - root - INFO - step: 130  loss:  6.4930  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.69  mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:29:26,476 - root - INFO - step: 140  loss:  6.4257  memory: 49.66GiB(52.28%)  tps: 5,863  tflops: 339.53  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:29:38,860 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds.
[rank0]:[titan] 2025-06-16 22:29:40,448 - root - INFO - step: 150  loss:  6.0684  memory: 49.66GiB(52.28%)  tps: 5,864  tflops: 339.61  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:29:54,415 - root - INFO - step: 160  loss:  6.3202  memory: 49.66GiB(52.28%)  tps: 5,866  tflops: 339.72  mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:30:08,386 - root - INFO - step: 170  loss:  5.9970  memory: 49.66GiB(52.28%)  tps: 5,864  tflops: 339.64  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:30:22,357 - root - INFO - step: 180  loss:  6.0888  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.65  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:30:36,333 - root - INFO - step: 190  loss:  6.7146  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.52  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:30:48,735 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:30:50,322 - root - INFO - step: 200  loss:  6.1384  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.16  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:30:50,758 - root - INFO - Dumping profiler traces at step 200
[rank0]:[titan] 2025-06-16 22:30:50,976 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[titan] 2025-06-16 22:31:04,970 - root - INFO - step: 210  loss:  6.0388  memory: 49.66GiB(52.28%)  tps: 5,593  tflops: 323.92  mfu: 32.75%
[rank0]:[titan] 2025-06-16 22:31:18,934 - root - INFO - step: 220  loss:  6.0348  memory: 49.66GiB(52.28%)  tps: 5,868  tflops: 339.82  mfu: 34.36%
[rank0]:[titan] 2025-06-16 22:31:32,895 - root - INFO - step: 230  loss:  5.9670  memory: 49.66GiB(52.28%)  tps: 5,869  tflops: 339.88  mfu: 34.37%
[rank0]:[titan] 2025-06-16 22:31:46,869 - root - INFO - step: 240  loss:  5.8311  memory: 49.66GiB(52.28%)  tps: 5,864  tflops: 339.59  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:31:59,261 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:32:00,846 - root - INFO - step: 250  loss:  5.8383  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.48  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:32:15,834 - root - INFO - step: 260  loss:  5.7541  memory: 49.66GiB(52.28%)  tps: 5,467  tflops: 316.60  mfu: 32.01%
[rank0]:[titan] 2025-06-16 22:32:29,803 - root - INFO - step: 270  loss:  5.7202  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.66  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:32:43,772 - root - INFO - step: 280  loss:  5.6326  memory: 49.66GiB(52.28%)  tps: 5,866  tflops: 339.71  mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:32:57,734 - root - INFO - step: 290  loss:  5.7467  memory: 49.66GiB(52.28%)  tps: 5,868  tflops: 339.85  mfu: 34.36%
[rank0]:[titan] 2025-06-16 22:33:10,148 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-16 22:33:11,718 - root - INFO - step: 300  loss:  5.7704  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.31  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:33:12,173 - root - INFO - Dumping profiler traces at step 300
[rank0]:[titan] 2025-06-16 22:33:12,371 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:33:26,359 - root - INFO - step: 310  loss:  5.6014  memory: 49.66GiB(52.28%)  tps: 5,596  tflops: 324.09  mfu: 32.77%
[rank0]:[titan] 2025-06-16 22:33:40,337 - root - INFO - step: 320  loss:  5.5262  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.47  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:33:54,315 - root - INFO - step: 330  loss:  5.4930  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.45  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:34:08,294 - root - INFO - step: 340  loss:  5.5682  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.44  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:34:20,693 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:34:22,279 - root - INFO - step: 350  loss:  5.5599  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.32  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:34:36,250 - root - INFO - step: 360  loss:  5.5070  memory: 49.66GiB(52.28%)  tps: 5,864  tflops: 339.63  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:34:50,242 - root - INFO - step: 370  loss:  5.4143  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.11  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:35:04,230 - root - INFO - step: 380  loss:  5.4226  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.23  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:35:18,206 - root - INFO - step: 390  loss:  5.3208  memory: 49.66GiB(52.28%)  tps: 5,863  tflops: 339.55  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:35:30,661 - root - INFO - [GC] Peforming periodical GC collection. 0.05 seconds.
[rank0]:[titan] 2025-06-16 22:35:32,251 - root - INFO - step: 400  loss:  5.4494  memory: 49.66GiB(52.28%)  tps: 5,833  tflops: 337.83  mfu: 34.16%
[rank0]:[titan] 2025-06-16 22:35:32,714 - root - INFO - Dumping profiler traces at step 400
[rank0]:[titan] 2025-06-16 22:35:32,916 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:35:46,913 - root - INFO - step: 410  loss:  5.2551  memory: 49.66GiB(52.28%)  tps: 5,588  tflops: 323.63  mfu: 32.72%
[rank0]:[titan] 2025-06-16 22:36:00,884 - root - INFO - step: 420  loss:  5.4090  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.64  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:36:14,870 - root - INFO - step: 430  loss:  5.3344  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.27  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:36:28,864 - root - INFO - step: 440  loss:  5.3106  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.10  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:36:41,255 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:36:42,843 - root - INFO - step: 450  loss:  5.2298  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.47  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:36:56,829 - root - INFO - step: 460  loss:  5.2365  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.26  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:37:10,821 - root - INFO - step: 470  loss:  5.1352  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.15  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:37:24,810 - root - INFO - step: 480  loss:  5.1063  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.20  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:37:38,791 - root - INFO - step: 490  loss:  5.1559  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.40  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:37:51,229 - root - INFO - [GC] Peforming periodical GC collection. 0.08 seconds.
[rank0]:[titan] 2025-06-16 22:37:52,788 - root - INFO - step: 500  loss:  5.2053  memory: 49.66GiB(52.28%)  tps: 5,853  tflops: 339.00  mfu: 34.28%
[rank0]:[titan] 2025-06-16 22:37:53,282 - root - INFO - Dumping profiler traces at step 500
[rank0]:[titan] 2025-06-16 22:37:53,479 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:38:07,509 - root - INFO - step: 510  loss:  5.0252  memory: 49.66GiB(52.28%)  tps: 5,565  tflops: 322.31  mfu: 32.59%
[rank0]:[titan] 2025-06-16 22:38:22,243 - root - INFO - step: 520  loss:  5.2369  memory: 49.66GiB(52.28%)  tps: 5,561  tflops: 322.06  mfu: 32.56%
[rank0]:[titan] 2025-06-16 22:38:36,217 - root - INFO - step: 530  loss:  5.1380  memory: 49.66GiB(52.28%)  tps: 5,863  tflops: 339.56  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:38:50,196 - root - INFO - step: 540  loss:  5.0538  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.44  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:39:02,608 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:39:04,186 - root - INFO - step: 550  loss:  5.0703  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.16  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:39:18,173 - root - INFO - step: 560  loss:  5.1104  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.27  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:39:32,166 - root - INFO - step: 570  loss:  4.9368  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.12  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:39:46,148 - root - INFO - step: 580  loss:  5.0037  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.38  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:40:00,139 - root - INFO - step: 590  loss:  5.0303  memory: 49.66GiB(52.28%)  tps: 5,856  tflops: 339.16  mfu: 34.29%
[rank0]:[titan] 2025-06-16 22:40:12,566 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:40:14,145 - root - INFO - step: 600  loss:  4.9233  memory: 49.66GiB(52.28%)  tps: 5,850  tflops: 338.78  mfu: 34.25%
[rank0]:[titan] 2025-06-16 22:40:14,671 - root - INFO - Dumping profiler traces at step 600
[rank0]:[titan] 2025-06-16 22:40:14,871 - root - INFO - Finished dumping profiler traces in 0.20 seconds
[rank0]:[titan] 2025-06-16 22:40:28,831 - root - INFO - step: 610  loss:  5.0338  memory: 49.66GiB(52.28%)  tps: 5,579  tflops: 323.10  mfu: 32.67%
[rank0]:[titan] 2025-06-16 22:40:42,800 - root - INFO - step: 620  loss:  5.0437  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.68  mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:40:56,779 - root - INFO - step: 630  loss:  4.8868  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.46  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:41:10,768 - root - INFO - step: 640  loss:  4.8542  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.19  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:41:23,176 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:41:24,756 - root - INFO - step: 650  loss:  5.0227  memory: 49.66GiB(52.28%)  tps: 5,857  tflops: 339.23  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:41:38,741 - root - INFO - step: 660  loss:  4.8901  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.32  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:41:52,723 - root - INFO - step: 670  loss:  4.8374  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.36  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:42:06,707 - root - INFO - step: 680  loss:  4.9357  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.33  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:42:20,691 - root - INFO - step: 690  loss:  4.6904  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.33  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:42:33,106 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:42:34,686 - root - INFO - step: 700  loss:  4.8426  memory: 49.66GiB(52.28%)  tps: 5,854  tflops: 339.02  mfu: 34.28%
[rank0]:[titan] 2025-06-16 22:42:35,202 - root - INFO - Dumping profiler traces at step 700
[rank0]:[titan] 2025-06-16 22:42:35,411 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[titan] 2025-06-16 22:42:49,414 - root - INFO - step: 710  loss:  4.8630  memory: 49.66GiB(52.28%)  tps: 5,563  tflops: 322.17  mfu: 32.58%
[rank0]:[titan] 2025-06-16 22:43:03,389 - root - INFO - step: 720  loss:  4.8307  memory: 49.66GiB(52.28%)  tps: 5,863  tflops: 339.54  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:43:17,366 - root - INFO - step: 730  loss:  5.0704  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.48  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:43:31,346 - root - INFO - step: 740  loss:  4.9245  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.40  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:43:43,762 - root - INFO - [GC] Peforming periodical GC collection. 0.07 seconds.
[rank0]:[titan] 2025-06-16 22:43:45,331 - root - INFO - step: 750  loss:  4.6614  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.31  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:43:59,308 - root - INFO - step: 760  loss:  4.7145  memory: 49.66GiB(52.28%)  tps: 5,862  tflops: 339.50  mfu: 34.33%
[rank0]:[titan] 2025-06-16 22:44:13,608 - root - INFO - step: 770  loss:  4.7226  memory: 49.66GiB(52.28%)  tps: 5,730  tflops: 331.83  mfu: 33.55%
[rank0]:[titan] 2025-06-16 22:44:28,628 - root - INFO - step: 780  loss:  4.7430  memory: 49.66GiB(52.28%)  tps: 5,455  tflops: 315.93  mfu: 31.94%
[rank0]:[titan] 2025-06-16 22:44:42,611 - root - INFO - step: 790  loss:  4.6875  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.35  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:44:55,024 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:44:56,605 - root - INFO - step: 800  loss:  4.6901  memory: 49.66GiB(52.28%)  tps: 5,855  tflops: 339.06  mfu: 34.28%
[rank0]:[titan] 2025-06-16 22:44:57,138 - root - INFO - Dumping profiler traces at step 800
[rank0]:[titan] 2025-06-16 22:44:57,344 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[titan] 2025-06-16 22:45:11,360 - root - INFO - step: 810  loss:  4.6838  memory: 49.66GiB(52.28%)  tps: 5,552  tflops: 321.56  mfu: 32.51%
[rank0]:[titan] 2025-06-16 22:45:25,305 - root - INFO - step: 820  loss:  4.6000  memory: 49.66GiB(52.28%)  tps: 5,876  tflops: 340.28  mfu: 34.41%
[rank0]:[titan] 2025-06-16 22:45:39,251 - root - INFO - step: 830  loss:  4.6733  memory: 49.66GiB(52.28%)  tps: 5,875  tflops: 340.25  mfu: 34.40%
[rank0]:[titan] 2025-06-16 22:45:53,208 - root - INFO - step: 840  loss:  4.6178  memory: 49.66GiB(52.28%)  tps: 5,870  tflops: 339.98  mfu: 34.38%
[rank0]:[titan] 2025-06-16 22:46:05,611 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:46:07,194 - root - INFO - step: 850  loss:  4.7753  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.27  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:46:21,172 - root - INFO - step: 860  loss:  4.5818  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.45  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:46:35,159 - root - INFO - step: 870  loss:  4.5332  memory: 49.66GiB(52.28%)  tps: 5,858  tflops: 339.25  mfu: 34.30%
[rank0]:[titan] 2025-06-16 22:46:49,143 - root - INFO - step: 880  loss:  4.4733  memory: 49.66GiB(52.28%)  tps: 5,859  tflops: 339.34  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:47:03,125 - root - INFO - step: 890  loss:  4.6788  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.38  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:47:15,614 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:47:17,195 - root - INFO - step: 900  loss:  4.6943  memory: 49.66GiB(52.28%)  tps: 5,823  tflops: 337.23  mfu: 34.10%
[rank0]:[titan] 2025-06-16 22:47:17,789 - root - INFO - Dumping profiler traces at step 900
[rank0]:[titan] 2025-06-16 22:47:18,004 - root - INFO - Finished dumping profiler traces in 0.22 seconds
[rank0]:[titan] 2025-06-16 22:47:31,972 - root - INFO - step: 910  loss:  4.4559  memory: 49.66GiB(52.28%)  tps: 5,544  tflops: 321.10  mfu: 32.47%
[rank0]:[titan] 2025-06-16 22:47:45,950 - root - INFO - step: 920  loss:  4.5920  memory: 49.66GiB(52.28%)  tps: 5,861  tflops: 339.46  mfu: 34.32%
[rank0]:[titan] 2025-06-16 22:47:59,923 - root - INFO - step: 930  loss:  4.6219  memory: 49.66GiB(52.28%)  tps: 5,864  tflops: 339.58  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:48:13,889 - root - INFO - step: 940  loss:  4.5429  memory: 49.66GiB(52.28%)  tps: 5,867  tflops: 339.76  mfu: 34.35%
[rank0]:[titan] 2025-06-16 22:48:26,271 - root - INFO - [GC] Peforming periodical GC collection. 0.06 seconds.
[rank0]:[titan] 2025-06-16 22:48:27,849 - root - INFO - step: 950  loss:  4.6160  memory: 49.66GiB(52.28%)  tps: 5,870  tflops: 339.93  mfu: 34.37%
[rank0]:[titan] 2025-06-16 22:48:41,809 - root - INFO - step: 960  loss:  4.5495  memory: 49.66GiB(52.28%)  tps: 5,869  tflops: 339.91  mfu: 34.37%
[rank0]:[titan] 2025-06-16 22:48:55,764 - root - INFO - step: 970  loss:  4.6507  memory: 49.66GiB(52.28%)  tps: 5,871  tflops: 340.03  mfu: 34.38%
[rank0]:[titan] 2025-06-16 22:49:09,734 - root - INFO - step: 980  loss:  4.5222  memory: 49.66GiB(52.28%)  tps: 5,865  tflops: 339.66  mfu: 34.34%
[rank0]:[titan] 2025-06-16 22:49:23,717 - root - INFO - step: 990  loss:  4.6672  memory: 49.66GiB(52.28%)  tps: 5,860  tflops: 339.37  mfu: 34.31%
[rank0]:[titan] 2025-06-16 22:49:36,204 - root - INFO - [GC] Peforming periodical GC collection. 0.10 seconds.
[rank0]:[titan] 2025-06-16 22:49:37,745 - root - INFO - step: 1000  loss:  4.6165  memory: 49.66GiB(52.28%)  tps: 5,840  tflops: 338.24  mfu: 34.20%
[rank0]:[titan] 2025-06-16 22:49:38,344 - root - INFO - Dumping profiler traces at step 1000
[rank0]:[titan] 2025-06-16 22:49:38,552 - root - INFO - Finished dumping profiler traces in 0.21 seconds
[rank0]:[titan] 2025-06-16 22:49:38,553 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-16 22:49:40,553 - root - INFO - Training completed
[rank0]:[titan] 2025-06-16 22:49:41,030 - root - INFO - Process group destroyed.
```

---------

Co-authored-by: saiteja64 &lt;saiteja64@devvm8023.cco0.facebook.com&gt;
diff --git a/tests/unit_tests/test_checkpoint.py b/tests/unit_tests/test_checkpoint.py
@@ -420,15 +420,15 @@ def test_async_save_calls_async_wait(self, mock_async_save, mock_new_group):
 
         # First save schedules async
         manager.save(curr_step=10, last_step=False)
-        future = manager.async_future
+        future = manager.save_future
         future.result.assert_not_called()
 
         # Second save should wait
         manager.save(curr_step=20, last_step=False)
         future.result.assert_called_once()
 
         # New future created
-        new_future = manager.async_future
+        new_future = manager.save_future
         new_future.result.assert_not_called()
 
     @mock.patch("torch.cuda.Stream")
@@ -446,8 +446,10 @@ def test_ft_async_save_calls_async_wait(
         Test that with FT enabled, AsyncMode.ASYNC via FT triggers correct waits.
         """
         job_config = DummyJobConfig(job=self.job_config.job)
-        job_config.checkpoint.async_mode = "disabled"
+        job_config.checkpoint.async_mode = "async"
         ft_manager = mock.Mock()
+        ft_manager.manager.return_value = mock.Mock()
+        ft_manager.manager.participating_rank = mock.Mock(return_value=0)
         ft_manager.enabled = True
         states = {"trainer": torch.tensor([0])}
         manager = CheckpointManager(
@@ -461,16 +463,16 @@ def test_ft_async_save_calls_async_wait(
         )
 
         # Initially no future
-        self.assertIsNone(manager.async_future)
+        self.assertIsNone(manager.save_future)
         manager.save(curr_step=5, last_step=False)
-        self.assertIsNotNone(manager.async_future)
+        self.assertIsNotNone(manager.save_future)
 
-        manager.async_future.result.assert_not_called()
-        prev_future = manager.async_future
+        manager.save_future.result.assert_not_called()
+        prev_future = manager.save_future
         manager.save(curr_step=6, last_step=False)
         prev_future.result.assert_called_once()
-        self.assertIsNotNone(manager.async_future)
-        manager.async_future.result.assert_not_called()
+        self.assertIsNotNone(manager.save_future)
+        manager.save_future.result.assert_not_called()
 
     @mock.patch("torch.distributed.get_rank", return_value=0)
     @mock.patch("torchtitan.components.checkpoint.dcp.save")
diff --git a/torchtitan/components/checkpoint.py b/torchtitan/components/checkpoint.py
@@ -17,22 +17,22 @@
 import torch
 import torch.distributed as dist
 import torch.distributed.checkpoint as dcp
-import torch.multiprocessing as mp
 import torch.nn as nn
-from torch.distributed._state_dict_utils import _copy_state_dict, _create_cpu_state_dict
+from torch.distributed.checkpoint.staging import DefaultStager, StagingOptions
 from torch.distributed.checkpoint.state_dict import (
     get_model_state_dict,
     set_model_state_dict,
     StateDictOptions,
 )
+from torch.distributed.checkpoint.state_dict_saver import AsyncCheckpointerType
 from torch.distributed.checkpoint.stateful import Stateful
 from torch.utils.data import DataLoader
 
 from torchtitan.components.ft import FTManager
 from torchtitan.components.lr_scheduler import LRSchedulersContainer
 from torchtitan.components.optimizer import OptimizersContainer
 from torchtitan.config_manager import JobConfig, TORCH_DTYPE_MAP
-from torchtitan.tools.logging import init_logger, logger
+from torchtitan.tools.logging import logger
 from torchtitan.tools.utils import GarbageCollection
 
 
@@ -98,43 +98,6 @@ def save_with_gc(state, checkpoint_id):
     GarbageCollection.collect("GC collection invoked by checkpointer.")
 
 
-def checkpoint_mp(recv: mp.Queue, send: mp.Queue):
-    """Process to save the checkpoint in the background.
-
-    This is only used when async_checkpoint_with_pinned_memory is enabled.
-
-    Args:
-        recv (mp.Queue): The queue to receive the state_dict and Terminate signal.
-        send (mp.Queue): The queue to send the SaveDone signal.
-    """
-    init_logger()
-    os.environ["MASTER_PORT"] = str(int(os.environ["MASTER_PORT"]) + 2)
-    os.environ["TORCHELASTIC_USE_AGENT_STORE"] = "False"
-    torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
-    dist.init_process_group()
-    try:
-        while True:
-            logger.debug("Checkpoint background process is done.")
-            send.put(SaveDone())
-            logger.debug("Wait for the new state_dict.")
-            obj = recv.get()
-            logger.debug("Received the new state_dict.")
-            if isinstance(obj, Terminate):
-                logger.info("Terminating the checkpoint background process.")
-                return
-            assert isinstance(obj, tuple)
-            begin = time.monotonic()
-            state, checkpoint_id = obj
-            save_with_gc(state, checkpoint_id=checkpoint_id)
-            logger.info(
-                "Finish saving the checkpoint in the background process in %.2f seconds.",
-                time.monotonic() - begin,
-            )
-    finally:
-        logger.info("Destroying the process group.")
-        dist.destroy_process_group()
-
-
 def purge_thread(purge_queue: queue.Queue):
     """Thread to purge the old checkpoints.
 
@@ -275,7 +238,7 @@ def load_state_dict(state_dict):
         self.sending_to_checkpoint_mp = False
         self.staging_id = None
         self.cpu_offload_state_dict = None
-        self.staging_stream = torch.cuda.Stream() if self.enable_staging else None
+        self.stager = None
 
         self.folder = os.path.join(job_config.job.dump_folder, ckpt_config.folder)
 
@@ -292,7 +255,11 @@ def load_state_dict(state_dict):
 
         # Async checkpoint related fields.
         async_mode = ckpt_config.async_mode.lower()
-        if async_mode == AsyncMode.ASYNC or self.ft_manager:
+        if (
+            async_mode == AsyncMode.ASYNC
+            or async_mode == AsyncMode.ASYNC_WITH_PINNED_MEM
+            or self.ft_manager
+        ):
             self.pg = dist.new_group(backend="gloo")
 
         self.keep_latest_k = ckpt_config.keep_latest_k
@@ -311,25 +278,14 @@ def load_state_dict(state_dict):
             self.purge_thread = None
 
         self.mp = None
-        self.async_future = None
+        self.staging_future = None
+        self.save_future = None
         if async_mode == AsyncMode.DISABLED:
             self.async_mode = AsyncMode.DISABLED
         elif async_mode == AsyncMode.ASYNC:
             self.async_mode = AsyncMode.ASYNC
         elif async_mode == AsyncMode.ASYNC_WITH_PINNED_MEM:
             self.async_mode = AsyncMode.ASYNC_WITH_PINNED_MEM
-            ctx = mp.get_context("spawn")
-            self.mp_queue_send = ctx.Queue()
-            self.mp_queue_recv = ctx.Queue()
-            self.mp = ctx.Process(
-                target=checkpoint_mp,
-                args=(
-                    self.mp_queue_send,
-                    self.mp_queue_recv,
-                ),
-                daemon=True,
-            )
-            self.mp.start()
         else:
             raise ValueError(f"Unkown checkpoint async_mode {ckpt_config.async_mode}")
 
@@ -353,6 +309,9 @@ def close(self):
                 self.purge_queue.put(Terminate())
                 self.purge_thread.join()
 
+            if self.stager is not None:
+                self.stager.close()
+
     @torch.no_grad()
     def save(self, curr_step: int, last_step: bool = False) -> None:
         """Save the checkpoint for the current step.
@@ -388,10 +347,20 @@ def save(self, curr_step: int, last_step: bool = False) -> None:
                 self._save_last_step(curr_step)
             elif self.async_mode == AsyncMode.ASYNC_WITH_PINNED_MEM:
                 GarbageCollection.collect("GC collection invoked by checkpointer.")
-                self._async_with_pinned_memory(checkpoint_id)
+                if self.stager is None:
+                    self.stager = DefaultStager(StagingOptions(True, True, True, True))
+                result = dcp.async_save(
+                    self.states,
+                    checkpoint_id=checkpoint_id,
+                    process_group=self.pg,
+                    async_checkpointer_type=AsyncCheckpointerType.PROCESS,
+                    async_stager=self.stager,
+                )
+                self.save_future = result.upload_completion
+                self.staging_future = result.staging_completion
             elif self.async_mode == AsyncMode.ASYNC:
                 GarbageCollection.collect("GC collection invoked by checkpointer.")
-                self.async_future = dcp.async_save(
+                self.save_future = dcp.async_save(
                     self.states, checkpoint_id=checkpoint_id, process_group=self.pg
                 )
                 GarbageCollection.collect("GC collection invoked by checkpointer.")
@@ -475,33 +444,7 @@ def maybe_wait_for_staging(self) -> None:
         with ``async_checkpoint_with_pinned_memory``.
         """
         if self.enable_staging and self.staging:
-            if not self.staging_stream.query():
-                begin = time.monotonic()
-                self.staging_stream.synchronize()
-                logger.info(
-                    "Checkpointer waited staging %.2f seconds.",
-                    time.monotonic() - begin,
-                )
-            self.staging = False
-
-            if self.sending_to_checkpoint_mp:
-                # Copy the sync staging result to another process.
-                def sync_func():
-                    self.mp_queue_send.put_nowait(
-                        (self.cpu_offload_state_dict, self.staging_id)
-                    )
-
-                # This may be a faster way to do zero-overhead checkpointing staging
-                # checkpointing but we need more thorough investigation before
-                # swithing to this method.
-                # self.my_thread = threading.Thread(target=func).start()
-                begin = time.monotonic()
-                sync_func()
-                logger.info(
-                    "Checkpointer sent staged state_dict to another process %.2f seconds",
-                    time.monotonic() - begin,
-                )
-                self.sending_to_checkpoint_mp = False
+            self.staging_future.result()
 
     def _find_load_step(self, folder: str = "") -> int:
         """Find the step to load the checkpoint for.
@@ -540,7 +483,7 @@ def _ft_save(self, step: int) -> None:
         begin = time.monotonic()
         self._async_wait()
         checkpoint_id = self._create_checkpoint_id(step, folder=self._ft_folder())
-        self.async_future = dcp.async_save(
+        self.save_future = dcp.async_save(
             self.ft_states, checkpoint_id=checkpoint_id, process_group=self.pg
         )
         logger.info(f"Staging ft checkpoint took {time.monotonic() - begin} secs.")
@@ -633,45 +576,18 @@ def _should_save(self, curr_step: int, last_step: bool = False) -> bool:
 
     def _async_wait(self) -> None:
         if self.async_mode == AsyncMode.ASYNC_WITH_PINNED_MEM:
-            logger.debug(
-                f"Waiting for the background process to finish, {time.monotonic()=}.:.2f"
-            )
-            if not self.mp.is_alive():
-                raise RuntimeError("The checkpoint background process is dead.")
-            _ = self.mp_queue_recv.get()
+            if self.save_future is not None:
+                self.save_future.result()
         elif self.async_mode == AsyncMode.ASYNC or self.ft_manager is not None:
-            if self.async_future is not None:
-                self.async_future.result()
-                self.async_future = None
-        elif self.async_future is not None:
+            if self.save_future is not None:
+                self.save_future.result()
+                self.save_future = None
+        elif self.save_future is not None:
             raise RuntimeError(
-                "self.async_future is not None, but self.async_mode is not enabled "
+                "self.save_future is not None, but self.async_mode is not enabled "
                 "and fault tolerance is not active."
             )
 
-    def _async_with_pinned_memory(self, checkpoint_id: str) -> None:
-        self._cpu_staging(checkpoint_id)
-        self.sending_to_checkpoint_mp = True
-
-    def _cpu_staging(self, checkpoint_id: str | None) -> None:
-        """Offload state_dict to CPU memory"""
-        state_dict = dcp.state_dict_saver._stateful_to_state_dict(self.states)
-        if self.cpu_offload_state_dict is None:
-            logger.debug(f"Preparing the CPU memory, {time.monotonic()=}.:.2f")
-            self.cpu_offload_state_dict = _create_cpu_state_dict(
-                state_dict, pin_memory=True, share_memory=True
-            )
-
-        logger.debug(f"Staging the state_dict, {time.monotonic()=}.:.2f")
-        with torch.cuda.stream(self.staging_stream):
-            self.cpu_offload_state_dict = _copy_state_dict(
-                state_dict,
-                self.cpu_offload_state_dict,
-                non_blocking=True,
-            )
-            self.staging = True
-            self.staging_id = checkpoint_id
-
     def _purge_stale_checkpoints(self):
         if (
             self.keep_latest_k > 0