Why use tensor parallelism when model can easily fit on a single GPU ? #294
vikigenius
announced in
Q&A
Replies: 2 comments 3 replies
-
You are right, for small models, you should just use one GPU. You can start multiple vLLM replicas to achieve "data parallelism" for serving, so that is not shown in our code. Tensor parallelism is mainly for large models that cannot fit a single GPU. |
Beta Was this translation helpful? Give feedback.
3 replies
-
Model parallelism in serving scenarios can be indeed beneficial when serving multiple models at the same time (https://arxiv.org/abs/2302.11665), but in case of serving a single model then you should stick to one GPU |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
If the model can fit on a single GPU, wouldn't it be better to use something like DDP instead? What are the advantages of using tensor parallelism if the model is small enough to fit on a single GPU ?
Beta Was this translation helpful? Give feedback.
All reactions