-
Notifications
You must be signed in to change notification settings - Fork 378
Description
Environment
-
Opacus Version: (1.5.4)
-
PyTorch Version: (2.4.1+cu121 )
-
Hugging Face accelerate Version: (1.7.0)
-
Python Version: 3.9
-
Hardware: 2 x NVIDIA GPUs (e.g., A100)
-
Training Setup: Hugging Face accelerate for Distributed Data Parallel (DDP) training.
Description
I am attempting to train a custom U-ViT (a U-Net style Transformer) diffusion model with Opacus for differential privacy. The model's core forward
method requires two tensor inputs: the noised data x
and the timesteps
.
This multi-input structure appears to create a fundamental incompatibility with Opacus's forward and backward hooks, leading to a cycle of IndexError
and TypeError
that can only be solved by a significant architectural refactoring----using a single input tensor. But this is contradictory to my model training since I have to leave the embedding
layer outside of model training to create a single input for my forward
.
The Debugging Journey & The "Catch-22"
The debugging process revealed a "Catch-22" situation where fixing one error would immediately cause the other, pointing to a conflict between Opacus's forward and backward pass mechanisms.
1. Initial Error: TypeError
-
My initial model call was
nnet(x=xt, timesteps=t)
. -
This failed during the backward pass with
TypeError: forward() missing 1 required positional argument: 'timesteps'.
-
This indicates that Opacus's backward hook (using
functorch
) was not correctly replaying thetimesteps
keyword argument.
2. Fixing the TypeError
leads to IndexError
-
Changing the model call to use only positional arguments,
nnet(xt, t)
, resolved theTypeError
. -
However, this immediately caused a new error during the backward pass:
IndexError: list index out of range inside opacus.grad_sample.grad_sample_module._get_batch_size.
-
By adding debug prints, I confirmed that
module.activations
was an empty list[]
for the top-level module, meaning the forward hook failed to capture the inputs when they were passed positionally with a keywordx=...
.
3. The Root Cause: Mismatched Gradients
-
The breakthrough came from realizing the core issue, as described by a developer in another issue: Opacus expects the number of non-
None
gradients returned by the backward pass to match the number of tensor inputs from the forward pass. -
My model's
forward(x, timesteps)
method takes two tensors as input. -
However, the loss only depends on the model's parameters via the
x
tensor. Thetimesteps
tensor is used in the computation but does not itself have a gradient flowing back to it from the loss. -
Opacus sees two inputs but only one gradient, causing the mismatch and the crash.
The Solution: Architectural Refactoring
The only way to solve this was to refactor the architecture to ensure the Opacus-wrapped module (UViT
) has a simple, single-tensor input. By leaving the timesteps embedding layer up front of the model's forward method this is solved. But I have to set the accelerate
configuration and leave multiple adjustments in model inferencing, which is not a prominent solution for sure.
Conclusion
This issue seems to highlight a fundamental limitation in how Opacus handles models with multiple tensor inputs where not all inputs receive a gradient. The required architectural refactoring is significant and non-obvious. It would be beneficial if Opacus could either handle this case more gracefully or provide clearer error messages to guide users toward this solution.
Thank you for your time and for maintaining this library.