GAN model with ZeRO3 with offload #3088
Unanswered
EvgenyUgolkov
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Dear Team, good day
I try to run the GAN example which you provided (MNIST dataset), but with Zero3 and offloading feature in the configuration file as follow
I put it as a dictionary in the code and initiate the models as
The rest is exactly the same as you provided in the gan example
I try to run it with 2 GPU-s
After 1 successful iteration, i get the following error
[0/1][0/938] Loss_D: 1.4656 Loss_G: 4.7239 D(x): 0.6025 D(G(z)): 0.5315 / 0.0121[0/1][0/938] Loss_D: 1.4656 Loss_G: 4.7239 D(x): 0.6025 D(G(z)): 0.5315 / 0.0121
Traceback (most recent call last):
File "/ibex/user/ugolkoea/SUPER/GAN/gan/gan_deepspeed_train.py", line 208, in
main()
File "/ibex/user/ugolkoea/SUPER/GAN/gan/gan_deepspeed_train.py", line 205, in main
train(args)
File "/ibex/user/ugolkoea/SUPER/GAN/gan/gan_deepspeed_train.py", line 151, in train
output = netD(real)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
Traceback (most recent call last):
File "/ibex/user/ugolkoea/SUPER/GAN/gan/gan_deepspeed_train.py", line 208, in
main()
File "/ibex/user/ugolkoea/SUPER/GAN/gan/gan_deepspeed_train.py", line 205, in main
result = hook(self, args)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
train(args)
File "/ibex/user/ugolkoea/SUPER/GAN/gan/gan_deepspeed_train.py", line 151, in train
output = netD(real)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ret_val = func(*args, **kwargs)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 348, in _pre_forward_module_hook
result = hook(self, args)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
self.pre_sub_module_forward_function(module)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ret_val = func(*args, **kwargs)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 348, in _pre_forward_module_hook
return func(*args, **kwargs)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 475, in pre_sub_module_forward_function
param_coordinator.trace_prologue(sub_module)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 147, in trace_prologue
self.pre_sub_module_forward_function(module)
File "/ibex/user/ugolkoea/env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
if sub_module != self.__submodule_order[self.__step_id]:
IndexError: tuple index out of range
The whole output file is attached for your convenience
slurm-24400324.out.txt
Could you tell me, what am i doing wrong?
Regards, Evgeny
Beta Was this translation helpful? Give feedback.
All reactions