pipeline parallelism activation memory #7202

ghadialhajj · 2025-04-03T19:02:47Z

ghadialhajj
Apr 3, 2025

On Deepspeed's page here, it says that in pipeline parallelism, "the activation memory on the first stage of the pipeline is approximately the same as the total activation memory for a single micro-batch"

Doesn't the first stage (like any other stage) need to store all the activations, not just that of a single micro-batch, for later backward passes? Or did I misinterpret this sentence?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

pipeline parallelism activation memory #7202

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

pipeline parallelism activation memory #7202

Uh oh!

ghadialhajj Apr 3, 2025

Replies: 0 comments

ghadialhajj
Apr 3, 2025