File tree Expand file tree Collapse file tree 1 file changed +14
-0
lines changed Expand file tree Collapse file tree 1 file changed +14
-0
lines changed Original file line number Diff line number Diff line change @@ -588,6 +588,11 @@ class FaultTolerance:
588
588
"""
589
589
Whether to quantize the gradients before allreduce.
590
590
591
+ Disabled by default since the quantization does utilize the GPU
592
+ and uses more collectives. Enabling this requires knowing about
593
+ the tradeoffs between GPU utilization and communication.
594
+
595
+
591
596
This is only used when "semi_sync_method" is set.
592
597
"""
593
598
@@ -597,13 +602,22 @@ class FaultTolerance:
597
602
model fragment's synchronization. This is the "tao" parameter in
598
603
the Streaming DiLoCo paper.
599
604
605
+ By default, each model fragment will be synced at the same step
606
+ at which the allreduce is issued. Enabling delay can improve
607
+ communication and computation overlap, but at the cost of compromising
608
+ model quality
609
+
600
610
This is only used when "semi_sync_method" is set.
601
611
"""
602
612
603
613
fragment_update_alpha : float = 0.0
604
614
"""
605
615
Determines how to mix the local and global optimized parameters
606
616
617
+ By default, we just use the global parameters. This ensures all
618
+ DDP replicas have the same parameters after syncrhonizing on
619
+ the fragment. Tuning this can also affect the model quality.
620
+
607
621
This is only used when "semi_sync_method" is set.
608
622
"""
609
623
You can’t perform that action at this time.
0 commit comments