How to use pipeline parallelism in serve a bloom model? #3013
Unanswered
gaoxt1983
asked this question in
Community | Q&A
Replies: 2 comments 3 replies
-
Hi @gaoxt1983 In fact, as the BLOOM example demonstrates, we recommend using TP. Because PP is inefficient in generating tasks due to bubble. |
Beta Was this translation helpful? Give feedback.
3 replies
-
Is it normal that I generate one token for 100ms~120ms, in a node that holds 8 A100? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have a bloom 175b pretrained model. I want to serve this model with EnergonAI in a one node machine with 4 A100 GPUs. So I have modified example/bloom/run.sh:
What I have monitored later was the 4 GPUs were quite idle. the processes which I monitored were like this:
So what have I done wrong, and what should I do to achieve pipeline parallelism?
Beta Was this translation helpful? Give feedback.
All reactions