Automatically choosing optimal amount of layers to offload to GPUs #4049

SleepyYui · 2023-11-12T20:19:55Z

SleepyYui
Nov 12, 2023

As the title suggests, it would be nice to have the GPU layer-offload count automatically adjusted depending on factors such as available VRAM.

I have created a "working" prototype that utilizes Cuda and a single GPU to calculate the number of layers that can fit inside the GPU. Once the VRAM threshold is reached, offloading stops, and the RAM is utilized for the remaining layers. This feature would help try, test, and use (new) models, as it eliminates the need for trial-and-error to find the maximum number of layers that fit a specific case. This is especially useful considering that the number of layers and their size vary for each model.

A possible solution would be to calculate the VRAM needed for each layer and look at the available VRAM. Then, just offload the maximum amount of layers possible without passing the available VRAM limit. The remainder will go into RAM.

An example of getting available VRAM (this alone could help by simply exiting when more VRAM is planned to be allocated);

ggml-cuda.cu

size_t ggml_cuda_get_device_vram(int device) {
    size_t free, total;
    CUDA_CHECK(cudaSetDevice(device));
    CUDA_CHECK(cudaMemGetInfo(&free, &total));
    return free;
}

llama.cpp

size_t vram_available = ggml_cuda_get_device_vram(gpu_num);
// and (line 3316 for example)
if (vram_required > vram_available) {
    LLAMA_LOG_ERROR("%s: not enough VRAM to load model\n", __func__);
    exit(1);
}

latekvo · 2024-06-26T12:50:09Z

latekvo
Jun 26, 2024

bump, i don't know why this isn't a thing yet

0 replies

Dampfinchen · 2024-06-26T12:56:36Z

Dampfinchen
Jun 26, 2024

For automatic GPU layer selection to work I think the following variables have to considered:

-> Context size (amount of memory needed for the KV Cache at the full specified context)
-> Free VRAM available (Total amount of VRAM minus VRAM used by the OS and other programs, plus maybe leave around 100 MB as a buffer)
-> Model specific parameters (GQA availability, amount of layers, VRAM usage per layer)

1 reply

latekvo Jun 26, 2024

Alright I see the problem.

While most of these are known prior to loading, I'm not sure if free VRAM is in the scope of what llama.cpp may see.

edit: found cudaMemGetInfo() available on CUDA API
I'll play around with the example implementation in the OP and see what is left to be done.

piisawheel · 2024-09-03T06:58:52Z

piisawheel
Sep 3, 2024

Idk if I'm helping or not, But Ollama has implemented this feature. Perhaps you can look at their repository and see what they hooked into?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatically choosing optimal amount of layers to offload to GPUs #4049

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Automatically choosing optimal amount of layers to offload to GPUs #4049

Uh oh!

Uh oh!

SleepyYui Nov 12, 2023

Replies: 3 comments · 1 reply

Uh oh!

latekvo Jun 26, 2024

Uh oh!

Uh oh!

Dampfinchen Jun 26, 2024

Uh oh!

Uh oh!

latekvo Jun 26, 2024

Uh oh!

piisawheel Sep 3, 2024

SleepyYui
Nov 12, 2023

Replies: 3 comments 1 reply

latekvo
Jun 26, 2024

Dampfinchen
Jun 26, 2024

piisawheel
Sep 3, 2024