请教,shardfomer中GPT2FusedLinearConv1D_Col为什么反向做了两次allreduce #4961
lichenlu
started this conversation in
Community | General
Replies: 3 comments 3 replies
-
@lichenlu TP的时候,是在column parallel layer中,input在forward时不做任何操作在backward时reduce梯度,相对地,在row parallel layer中output在forward时需要reduce,在backward时不做任何操作。也就是说,像 |
Beta Was this translation helpful? Give feedback.
1 reply
-
@FrankLeeeee 这个细节能帮忙解释一些么 |
Beta Was this translation helpful? Give feedback.
0 replies
-
在GPT2FusedLinearConv1D_Col中,ctx.async_grad_allreduce默认值为False,所以matmul_with_async_comm的allreduce没有被执行,总共还是只执行了一次。 |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
如图,GPT2FusedLinearConv1D_Col的forward函数中使用了两个function,reduce_backward和matmul_with_async_comm,这两个函数在backward的时候都进行了allreduce操作,这里是否发生了冗余?
Beta Was this translation helpful? Give feedback.
All reactions