You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
## Types of changes
- [x] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Docs change / refactoring / dependency upgrade
## Motivation and Context / Related issue
This PR adds support for multi-device training scenarios where model parameters are distributed across multiple GPU devices (e.g., when assigning different layers directly with `module.to(device[I])` oe using `device_map="auto"` with accelerate).
**Problem solved:**
When training large models that don't fit on a single GPU, parameters and gradients can be spread across multiple devices. The existing Opacus optimizers and gradient clipping modules assumed all tensors were on the same device, causing runtime errors during norm computation and gradient clipping operations.
**Changes:**
1. **Sequential multi-device execution support (#9: Modified `DPOptimizer` and `AdaClipDPOptimizer` to move tensors to appropriate devices before operations like `torch.stack()` and `torch.einsum()`, preventing device mismatch errors during gradient clipping and accumulation.
2. **Multi-device support in GradSampleModuleFastGradientClipping (#10: Extended multi-device handling to `GradSampleModuleFastGradientClipping`, `DPPerLayerOptimizer`, and additional edge cases in optimizers that were previously uncovered.
## How Has This Been Tested
- The code was used to train 7B Zetta model with LoRA on 8xH200 GPU node.
- Added test suite in `multidevice_optimizer_test.py` covering:
- `DPOptimizer`, `AdaClipDPOptimizer`, and `DPPerLayerOptimizer` with multi-device models
- Both `clip_and_accumulate()` and full `step()` operations
- Helper function `_clip_and_accumulate_parameter()` with multi-device parameters
- Added additional tests in `grad_sample_module_fast_gradient_clipping_test.py` for:
- `get_norm_sample()` with parameters on different devices
- `get_clipping_coef()` with parameters on different devices
- All tests require at least 2 GPUs and verify that operations complete without device mismatch errors while maintaining correctness
## Checklist
- [x] The documentation is up-to-date with the changes I made.
- [x] I have read the **CONTRIBUTING** document and completed the CLA (see **CONTRIBUTING**).
- [x] All tests passed, and additional code has been covered with new tests.
Pull Request resolved: #796
Reviewed By: iden-kalemaj
Differential Revision: D85355821
fbshipit-source-id: 19da3c47ba5308748e839984194d1ce4b802d52f
0 commit comments