One model per GPU #786

WaleedAlfaris · 2023-10-03T06:07:59Z

WaleedAlfaris
Oct 3, 2023

Hello,

I have a system with 4 CUDA enabled GPUs, each with 16GB of VRAM. I have a single api which loads the models into a pool and uses a queue system to process queries in a first in first out sequence. I am able to sucesfully run 4 llama2-7B models on this system. However, When I do this, the models are split accross the 4 GPUs automatically. Is there any way to specify which models are loaded on which devices? I would like to load each model fully onto a single GPU, having model one fully loaded on GPU 0, model 2 on GPU 1, and so on, wihtout splitting a single model accross multiple GPUs. Is this possible?

When looking online, I found the export CUDA_VISIBLE_DEVICES=1 command, but since I am loading all the models in a single script this would limit all the models to the visible GPUs and would stil allocate them automatically. Unless there is a way to use the command in another way.

tk-master · 2023-11-04T20:15:38Z

tk-master
Nov 4, 2023

Wish I had multiple gpus to test it out but have you tried main_gpu param?
llm = Llama(model_path=model, n_gpu_layers=-1, n_ctx=4096, main_gpu=0)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

One model per GPU #786

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

One model per GPU #786

Uh oh!

Uh oh!

WaleedAlfaris Oct 3, 2023

Replies: 1 comment

Uh oh!

tk-master Nov 4, 2023

WaleedAlfaris
Oct 3, 2023

tk-master
Nov 4, 2023