Any way to automatically set the amount of GPU layers based on available GPU memory? #2846

samos123 · 2023-08-28T02:02:35Z

samos123
Aug 28, 2023

I don't want to have to experiment to find the optimal number for -ngl and rather have llama.cpp choose for me. So is there anything built in or anyone has a recommendation of how I could do this outside of llama.cpp?

BarfingLemurs · 2023-08-28T04:09:03Z

BarfingLemurs
Aug 28, 2023

But with ggml, you should always offload all layers.

There's 4 model sizes.
If you used 4bit llama2 models, at the default 4k context, then there should only be 4 total memory values, where you can decide the appropriate model that fits.

It might also be difficult for developers to optimize if the nvidia drivers automatically offload to ram nowadays. :(

1 reply

samos123 Aug 28, 2023
Author

But with ggml, you should always offload all layers.

but that will cause an error like this:

CUDA error 2 at /llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:6524: out of memory

So I'm left with figuring out exactly how many of the 42 layers should go GPU. Right now it's trial and error for me.

JohannesGaessler · 2023-08-29T13:52:32Z

JohannesGaessler
Aug 29, 2023
Collaborator

No.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Any way to automatically set the amount of GPU layers based on available GPU memory? #2846

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Any way to automatically set the amount of GPU layers based on available GPU memory? #2846

Uh oh!

samos123 Aug 28, 2023

Replies: 2 comments · 1 reply

Uh oh!

BarfingLemurs Aug 28, 2023

Uh oh!

samos123 Aug 28, 2023 Author

Uh oh!

JohannesGaessler Aug 29, 2023 Collaborator

samos123
Aug 28, 2023

Replies: 2 comments 1 reply

BarfingLemurs
Aug 28, 2023

samos123 Aug 28, 2023
Author

JohannesGaessler
Aug 29, 2023
Collaborator