Replies: 3 comments 1 reply
-
bump, i don't know why this isn't a thing yet |
Beta Was this translation helpful? Give feedback.
-
For automatic GPU layer selection to work I think the following variables have to considered: -> Context size (amount of memory needed for the KV Cache at the full specified context) |
Beta Was this translation helpful? Give feedback.
-
Idk if I'm helping or not, But Ollama has implemented this feature. Perhaps you can look at their repository and see what they hooked into? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
As the title suggests, it would be nice to have the GPU layer-offload count automatically adjusted depending on factors such as available VRAM.
I have created a "working" prototype that utilizes Cuda and a single GPU to calculate the number of layers that can fit inside the GPU. Once the VRAM threshold is reached, offloading stops, and the RAM is utilized for the remaining layers. This feature would help try, test, and use (new) models, as it eliminates the need for trial-and-error to find the maximum number of layers that fit a specific case. This is especially useful considering that the number of layers and their size vary for each model.
A possible solution would be to calculate the VRAM needed for each layer and look at the available VRAM. Then, just offload the maximum amount of layers possible without passing the available VRAM limit. The remainder will go into RAM.
An example of getting available VRAM (this alone could help by simply exiting when more VRAM is planned to be allocated);
ggml-cuda.cu
llama.cpp
Beta Was this translation helpful? Give feedback.
All reactions