single node multi-gpu inference - batch processing #12448

innat-asj · 2025-01-26T16:51:48Z

innat-asj
Jan 26, 2025

In the doc, it is mentioned that

To run multi-GPU inference with the LLM class, set the tensor_parallel_size argument to the number of GPUs you want to use. For example, to run inference on 4 GPUs:

from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")

If I'm not mistaken, using tp=4, the layers of the model will be distributed across 4 GPU. So, eventually, this would be **model-parallel` inferencing. But having a model that could fit into a single GPU, I like to do data-parallel, where each GPU will get a replica of a model. And that will allow me to send 4 input to 4 GPU. Is it possible with vllm? I probably can use python multiprocessing but how to control GPU assignment?

Jae-1409 · 2025-02-16T16:38:38Z

Jae-1409
Feb 16, 2025

If you want to run vllm with data-parallel, use nginx as a load balancer (check 👉 vllm docs)

BTW, if you are running vllm with one-node server, It is better to use tensor-parallel WRT concurrency and latency (only if your gpus can communicate with each other fast)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

single node multi-gpu inference - batch processing #12448

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

single node multi-gpu inference - batch processing #12448

Uh oh!

innat-asj Jan 26, 2025

Replies: 1 comment

Uh oh!

Jae-1409 Feb 16, 2025

innat-asj
Jan 26, 2025

Jae-1409
Feb 16, 2025