document default parameters for streaming diloco (#1308)

tushar00jain · web-flow · commit 1ab4353189b0 · 2025-06-16T17:33:20.000-07:00
Summary:
document why default parameters are set the way they are for streaming
diloco

Test Plan:
```
$ NGPU=2 ./run_train.sh --fault_tolerance.enable --fault_tolerance.group_size=1 --fault_tolerance.semi_sync_method=diloco --fault_tolerance.sync_steps=2 --fault_tolerance.replica_id=0 --fault_tolerance.fragment_sync_delay=1 --fault_tolerance.fragment_update_alpha=0.0

[rank0]:[titan] 2025-06-16 09:39:08,893 - root - INFO - Model llama3 debugmodel size: 6,270,208 total parameters
[rank0]:[titan] 2025-06-16 09:39:08,894 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:[titan] 2025-06-16 09:39:08,952 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-06-16 09:39:09,375 - root - WARNING - Peak flops undefined for: NVIDIA PG509-210, fallback to A100
[rank0]:[titan] 2025-06-16 09:39:09,376 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
[rank0]:[titan] 2025-06-16 09:39:09,376 - root - INFO - CUDA memory usage for model: 0.03GiB(0.04%)
[rank0]:[titan] 2025-06-16 09:39:09,377 - root - INFO - Trainer is initialized with local batch size 8, global batch size 16, gradient accumulation steps 1, sequence length 2048, total steps 10 (warmup 2).
[rank0]:[titan] 2025-06-16 09:39:09,377 - root - INFO - Training starts at step 1.
[rank0]:[titan] 2025-06-16 09:39:10,325 - root - INFO - step:  1  loss:  8.1934  memory:  1.26GiB(1.59%)  tps: 11,442  tflops: 0.82  mfu: 0.26%
[rank0]:[titan] 2025-06-16 09:39:10,325 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-06-16 09:39:10,431 - root - INFO - step:  2  loss:  8.1507  memory:  1.35GiB(1.71%)  tps: 154,916  tflops: 11.14  mfu: 3.57%
[rank0]:[titan] 2025-06-16 09:39:10,524 - root - INFO - step:  3  loss:  8.0737  memory:  1.35GiB(1.71%)  tps: 177,405  tflops: 12.76  mfu: 4.09%
[rank0]:[titan] 2025-06-16 09:39:10,623 - root - INFO - step:  4  loss:  7.8865  memory:  1.35GiB(1.71%)  tps: 167,289  tflops: 12.03  mfu: 3.86%
[rank0]:[titan] 2025-06-16 09:39:10,714 - root - INFO - step:  5  loss:  7.7620  memory:  1.35GiB(1.71%)  tps: 179,656  tflops: 12.92  mfu: 4.14%
[rank0]:[titan] 2025-06-16 09:39:10,808 - root - INFO - step:  6  loss:  7.5449  memory:  1.35GiB(1.71%)  tps: 175,901  tflops: 12.65  mfu: 4.05%
[rank0]:[titan] 2025-06-16 09:39:10,911 - root - INFO - step:  7  loss:  7.3452  memory:  1.35GiB(1.71%)  tps: 159,859  tflops: 11.49  mfu: 3.68%
[rank0]:[titan] 2025-06-16 09:39:11,005 - root - INFO - step:  8  loss:  7.2973  memory:  1.35GiB(1.71%)  tps: 175,980  tflops: 12.65  mfu: 4.06%
[rank0]:[titan] 2025-06-16 09:39:11,096 - root - INFO - step:  9  loss:  7.1333  memory:  1.35GiB(1.71%)  tps: 179,903  tflops: 12.94  mfu: 4.15%
[rank0]:[titan] 2025-06-16 09:39:11,186 - root - INFO - step: 10  loss:  7.0747  memory:  1.35GiB(1.71%)  tps: 184,628  tflops: 13.28  mfu: 4.26%
[rank0]:[titan] 2025-06-16 09:39:11,186 - root - INFO - Sleeping 2 seconds for other ranks to complete
[rank0]:[titan] 2025-06-16 09:39:13,186 - root - INFO - Training completed
[rank0]:[titan] 2025-06-16 09:39:13,489 - root - INFO - Process group destroyed.
```
diff --git a/torchtitan/config_manager.py b/torchtitan/config_manager.py
@@ -588,6 +588,11 @@ class FaultTolerance:
     """
     Whether to quantize the gradients before allreduce.
 
+    Disabled by default since the quantization does utilize the GPU
+    and uses more collectives. Enabling this requires knowing about
+    the tradeoffs between GPU utilization and communication.
+
+
     This is only used when "semi_sync_method" is set.
     """
 
@@ -597,13 +602,22 @@ class FaultTolerance:
     model fragment's synchronization. This is the "tao" parameter in
     the Streaming DiLoCo paper.
 
+    By default, each model fragment will be synced at the same step
+    at which the allreduce is issued. Enabling delay can improve
+    communication and computation overlap, but at the cost of compromising
+    model quality
+
     This is only used when "semi_sync_method" is set.
     """
 
     fragment_update_alpha: float = 0.0
     """
     Determines how to mix the local and global optimized parameters
 
+    By default, we just use the global parameters. This ensures all
+    DDP replicas have the same parameters after syncrhonizing on
+    the fragment. Tuning this can also affect the model quality.
+
     This is only used when "semi_sync_method" is set.
     """