Fine grained control of GPU offloading #7678

amaxymillian · 2024-05-31T21:11:23Z

amaxymillian
May 31, 2024

Is there a way to control exactly how many layers of a model get offloaded to each GPU in a workstation with multiple GPUs?

Right now I have a workstation with 3 GPUs:

I set CUDA_VISIBLE_DEVICES="2,0,1" (Listed in order of descending PCIe bandwidth).

I have monitors connected to two of the GPUs and so as you can see the 3090s have an uneven amount of available VRAM. As far as I can tell, I can only specify a certain number of layers to offload to the GPUs and llama.cpp looks at the max VRAM available to each GPU and divides up the layers accordingly but because the GPUs don't have an even amount of available VRAM I wind up with a situation that looks like this:

There are some edge cases where llama's default layer allocation forces me to take a more aggressive quant of a model to avoid malloc errors on the GPU that has the least available VRAM compared to what I think I could manage if I could control exactly how many layers (and in what GPU order) I offload.

So in my mind it would be useful if I could specify exactly the number of layers offloaded to each GPU to fine-tune performance on my workstation - is there anyway to do this with the current codebase?

Answered by robcowart

Sep 16, 2024

You will need to use the --tensor-split parameter...

-ts,   --tensor-split N0,N1,N2,...      fraction of the model to offload to each GPU, comma-separated list of
                                        proportions, e.g. 3,1

You may also need to use a few other parameter as well...

-c,    --ctx-size N                     size of the prompt context (default: 0, 0 = loaded from model)
                                        (env: LLAMA_ARG_CTX_SIZE)

-mg,   --main-gpu INDEX                 the GPU to use for the model (with split-mode = none), or for
                                        intermediate results and KV (with split-mode = row) (default: 0)

Consider the following GPUs...

+-----…

View full answer

robcowart · 2024-09-16T10:09:30Z

robcowart
Sep 16, 2024

You will need to use the --tensor-split parameter...

-ts,   --tensor-split N0,N1,N2,...      fraction of the model to offload to each GPU, comma-separated list of
                                        proportions, e.g. 3,1

You may also need to use a few other parameter as well...

-c,    --ctx-size N                     size of the prompt context (default: 0, 0 = loaded from model)
                                        (env: LLAMA_ARG_CTX_SIZE)

-mg,   --main-gpu INDEX                 the GPU to use for the model (with split-mode = none), or for
                                        intermediate results and KV (with split-mode = row) (default: 0)

Consider the following GPUs...

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A5000               Off |   00000000:01:00.0 Off |                  Off |
| 30%   44C    P8             19W /  230W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A5000               Off |   00000000:2D:00.0 Off |                  Off |
| 30%   51C    P8             20W /  230W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A5000               Off |   00000000:41:00.0 Off |                  Off |
| 30%   51C    P8             21W /  230W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A5000               Off |   00000000:61:00.0 Off |                  Off |
| 30%   47C    P8             16W /  230W |       2MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

First, let's run Llama 3, 70B Instruct, Q8...

CUDA_VISIBLE_DEVICES=0,1,2,3 ./llama-cli --interactive-first --color -ngl 999 --ctx-size 0 -sm row -m /mnt/data2/models/Meta-Llama-3-70B-Instruct.Q8_0.gguf

Here we are using --ctx-size 0 which allows for the models full context size n_ctx = 8192. As --main-gpu is not set, the default of 0 is used.

When the model is loaded you can see that GPU 0 is in fact using more memory as the context is located on that GPU.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      7974      C   ./llama-cli                                 22076MiB |
|    1   N/A  N/A      7974      C   ./llama-cli                                 18394MiB |
|    2   N/A  N/A      7974      C   ./llama-cli                                 18392MiB |
|    3   N/A  N/A      7974      C   ./llama-cli                                 18394MiB |
+-----------------------------------------------------------------------------------------+

As the context will always use the main GPU, you will want to use --main-gpu to specify the GPU where you want it to be loaded. For example, here I have specified --main-gpu 1.

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      9498      C   ./llama-cli                                 18392MiB |
|    1   N/A  N/A      9498      C   ./llama-cli                                 22078MiB |
|    2   N/A  N/A      9498      C   ./llama-cli                                 18392MiB |
|    3   N/A  N/A      9498      C   ./llama-cli                                 18394MiB |
+-----------------------------------------------------------------------------------------+

Next, let's try Llama 3.1, 70B Instruct, Q8...

CUDA_VISIBLE_DEVICES=0,1,2,3 ./llama-cli --interactive-first --color -ngl 999 --ctx-size 0 -sm row -m /mnt/data2/models/Meta-Llama-3-70B-Instruct.Q8_0.gguf

As Llama 3.1 has a context size of 131072 we get the dreaded out of memory error.

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 40960.00 MiB on device 0: cudaMalloc failed: out of memory

With some experimentation I used --tensor-split 1,2,2,2 to place 1/7th of the model on GPU 0, and 2/7ths on each of GPU 1,2 and 3. I could then use --ctx-size 28762 to the context size. While this is still well below what Llama 3.1 supports, it is a very relevant increase over Llama 3.

CUDA_VISIBLE_DEVICES=0,1,2,3 ./llama-cli --interactive-first --color -ngl 999 -n 8192 -sm row -m /mnt/data2/models/Meta-Llama-3.1-70B-Instruct.Q8_0.gguf  --main-gpu 0 --ctx-size 28672 --tensor-split 1,2,2,2

The breakdown of memory use across the GPUs is...

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     11374      C   ./llama-cli                                 23120MiB |
|    1   N/A  N/A     11374      C   ./llama-cli                                 20670MiB |
|    2   N/A  N/A     11374      C   ./llama-cli                                 20670MiB |
|    3   N/A  N/A     11374      C   ./llama-cli                                 21312MiB |
+-----------------------------------------------------------------------------------------+

Hopefully this is helpful.

2 replies

Allan-Luu Oct 19, 2024

Mind if I ask where the --ctx-size 28762 value came from?

robcowart Oct 20, 2024

Experimentation, stepping down the size until it would fit in VRAM.

amaxymillian · 2024-10-15T16:00:39Z

amaxymillian
Oct 15, 2024
Author

Ahh, thanks Rob that's exactly the information I was looking for!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine grained control of GPU offloading #7678

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Fine grained control of GPU offloading #7678

Uh oh!

Uh oh!

amaxymillian May 31, 2024

Replies: 2 comments · 2 replies

Uh oh!

robcowart Sep 16, 2024

Uh oh!

Allan-Luu Oct 19, 2024

Uh oh!

robcowart Oct 20, 2024

Uh oh!

amaxymillian Oct 15, 2024 Author

amaxymillian
May 31, 2024

Replies: 2 comments 2 replies

robcowart
Sep 16, 2024

amaxymillian
Oct 15, 2024
Author