Why use manual tensor parallelism implementation instead of using something like deepspeed? #267
vikigenius
announced in
Q&A
Replies: 1 comment 1 reply
-
Zero and FSDP communicate weights instead of activations. In inference, activations are small but weights are large. Therefore we choose tensor parallelism for efficiency. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I know it could have just been a design decision, but I would love to hear your rationale behind why you rolled your own implementation of model parallelism over using something like deepspeed as a dependency.
Beta Was this translation helpful? Give feedback.
All reactions