Skip to content

[BUG] The same program runs fine with v0.17.5, but fails with v0.17.6. Under the zero2 configuration #7629

@hemengfei2014-stack

Description

@hemengfei2014-stack

rank2: File "/mnt/data/anaconda3/envs/optimizer/lib/python3.13/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
rank2: ret_val = func(*args, **kwargs)
rank2: File "/mnt/data/anaconda3/envs/optimizer/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 2324, in backward
rank2: self._backward_epilogue()
rank2: ~~~~~~~~~~~~~~~~~~~~~~~^^
rank2: File "/mnt/data/anaconda3/envs/optimizer/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 2260, in _backward_epilogue
rank2: self.allreduce_gradients()
rank2: ~~~~~~~~~~~~~~~~~~~~~~~~^^
rank2: File "/mnt/data/anaconda3/envs/optimizer/lib/python3.13/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
rank2: ret_val = func(*args, **kwargs)
rank2: File "/mnt/data/anaconda3/envs/optimizer/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 2211, in allreduce_gradients
rank2: self.optimizer.overlapping_partition_gradients_reduce_epilogue()
rank2: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
rank2: File "/mnt/data/anaconda3/envs/optimizer/lib/python3.13/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 946, in overlapping_partition_gradients_reduce_epilogue
rank2: self.independent_gradient_partition_epilogue()
rank2: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
rank2: File "/mnt/data/anaconda3/envs/optimizer/lib/python3.13/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 860, in independent_gradient_partition_epilogue
rank2: for accumulated_grad, new_avg_grad in zip(self.all_grad_tensors[i], avg_new):
rank2: ~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank2: TypeError: 'NoneType' object is not iterable

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions