Replies: 1 comment
-
If you want to run vllm with data-parallel, use nginx as a load balancer (check 👉 vllm docs) BTW, if you are running vllm with one-node server, It is better to use tensor-parallel WRT concurrency and latency (only if your gpus can communicate with each other fast) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
In the doc, it is mentioned that
If I'm not mistaken, using
tp=4
, the layers of the model will be distributed across 4 GPU. So, eventually, this would be **model-parallel` inferencing. But having a model that could fit into a single GPU, I like to do data-parallel, where each GPU will get a replica of a model. And that will allow me to send 4 input to 4 GPU. Is it possible with vllm? I probably can use python multiprocessing but how to control GPU assignment?Beta Was this translation helpful? Give feedback.
All reactions