pipeline parallelism activation memory #7202
Unanswered
ghadialhajj
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
On Deepspeed's page here, it says that in pipeline parallelism, "the activation memory on the first stage of the pipeline is approximately the same as the total activation memory for a single micro-batch"
Doesn't the first stage (like any other stage) need to store all the activations, not just that of a single micro-batch, for later backward passes? Or did I misinterpret this sentence?
Beta Was this translation helpful? Give feedback.
All reactions