How to get GPU memory footprints when using distributed inference? #491

HermitSun · 2023-07-09T09:24:59Z

HermitSun
Jul 9, 2023

I want to observe the GPU memory footprint of models when performing inferences.

When I perform inferences on a single GPU, torch.cuda.memory_allocated returns a positive number as expected. But when I perform distributed inferences, torch.cuda.memory_allocated returns 0.

Should I use nvidia-smi or some other techniques to get the GPU memory footprint?

Any help would be appreciated.

WoosukKwon · 2023-07-13T23:35:46Z

WoosukKwon
Jul 13, 2023
Maintainer

Hi @HermitSun, thanks for trying out vLLM and good question. When using multiple GPUs, vLLM creates 1 worker process per GPU. Thus, if you use 2 GPUs, there will be 3 processes in total and the process running your code will not directly use any GPU. To actually get the number, you will need to insert torch.cuda.memory_allocated inside the Worker class or in the model code.

BTW, you can configure the vLLM's GPU memory usage via the gpu_memory_utilization argument in the LLM class.

0 replies

HermitSun · 2023-07-14T02:47:28Z

HermitSun
Jul 14, 2023
Author

Thank you for your kind reply. After I insert code inside the Worker, it works.

Maybe we can provide some profiling hooks or decorators, if possible.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to get GPU memory footprints when using distributed inference? #491

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to get GPU memory footprints when using distributed inference? #491

Uh oh!

Uh oh!

HermitSun Jul 9, 2023

Replies: 2 comments

Uh oh!

WoosukKwon Jul 13, 2023 Maintainer

Uh oh!

HermitSun Jul 14, 2023 Author

HermitSun
Jul 9, 2023

WoosukKwon
Jul 13, 2023
Maintainer

HermitSun
Jul 14, 2023
Author