Replies: 2 comments
-
Hi @HermitSun, thanks for trying out vLLM and good question. When using multiple GPUs, vLLM creates 1 worker process per GPU. Thus, if you use 2 GPUs, there will be 3 processes in total and the process running your code will not directly use any GPU. To actually get the number, you will need to insert BTW, you can configure the vLLM's GPU memory usage via the |
Beta Was this translation helpful? Give feedback.
-
Thank you for your kind reply. After I insert code inside the Maybe we can provide some profiling hooks or decorators, if possible. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I want to observe the GPU memory footprint of models when performing inferences.
When I perform inferences on a single GPU,
torch.cuda.memory_allocated
returns a positive number as expected. But when I perform distributed inferences,torch.cuda.memory_allocated
returns 0.Should I use
nvidia-smi
or some other techniques to get the GPU memory footprint?Any help would be appreciated.
Beta Was this translation helpful? Give feedback.
All reactions