GPU layers for GGML/GGUF models on Apple GPU #3316

Lyrcaxis · 2023-09-23T12:48:29Z

Lyrcaxis
Sep 23, 2023

Hello! I recently was informed that there's a limit of how much RAM a Mac's GPU can use in comparison to the total available RAM.
Rumors were that limit is either 66% or 75%, so I ran some tests and found out they were actually accurate:

Is the llama.cpp team aware of this? And if so, are there any official numbers on those limits? Any technical explanation behind them?

Would this info change the fact that offloading to CPU or putting effort towards combining CPU & GPU would be useful? (CC #3083)

ianscrivener · 2023-09-23T21:25:13Z

ianscrivener
Sep 23, 2023

Yes, there's be mention of this many times in the discussions - suggest you do some searching. There is a fix to tweak your MacOS kernel to change memory allocation. Requires a reboot and decreasing security protections.

0 replies

ggerganov · 2023-09-26T02:37:46Z

ggerganov
Sep 26, 2023
Maintainer

@Lyrcaxis If you apply this patch, does it work for you?

diff --git a/ggml-metal.m b/ggml-metal.m
index 1139ee31..9147a92f 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -441,7 +441,7 @@ bool ggml_metal_add_buffer(
         }
 
         // the buffer fits into the max buffer size allowed by the device
-        if (size_aligned <= ctx->device.maxBufferLength) {
+        if (size_aligned <= ctx->device.maxBufferLength/2) {
             ctx->buffers[ctx->n_buffers].name = name;
             ctx->buffers[ctx->n_buffers].data = data;
             ctx->buffers[ctx->n_buffers].size = size;
@@ -460,8 +460,8 @@ bool ggml_metal_add_buffer(
             // this overlap between the views will guarantee that the tensor with the maximum size will fully fit into
             // one of the views
             const size_t size_ovlp = ((max_size + size_page - 1) / size_page + 1) * size_page; // round-up 2 pages just in case
-            const size_t size_step = ctx->device.maxBufferLength - size_ovlp;
-            const size_t size_view = ctx->device.maxBufferLength;
+            const size_t size_step = ctx->device.maxBufferLength/2 - size_ovlp;
+            const size_t size_view = ctx->device.maxBufferLength/2;
 
             for (size_t i = 0; i < size; i += size_step) {
                 const size_t size_step_aligned = (i + size_view <= size) ? size_view : (size_aligned - i);

Make sure to use the --no-mmap flag

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU layers for GGML/GGUF models on Apple GPU #3316

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GPU layers for GGML/GGUF models on Apple GPU #3316

Uh oh!

Lyrcaxis Sep 23, 2023

Replies: 2 comments

Uh oh!

ianscrivener Sep 23, 2023

Uh oh!

ggerganov Sep 26, 2023 Maintainer

Lyrcaxis
Sep 23, 2023

ianscrivener
Sep 23, 2023

ggerganov
Sep 26, 2023
Maintainer