Replies: 2 comments
-
Yes, there's be mention of this many times in the discussions - suggest you do some searching. There is a fix to tweak your MacOS kernel to change memory allocation. Requires a reboot and decreasing security protections. |
Beta Was this translation helpful? Give feedback.
0 replies
-
@Lyrcaxis If you apply this patch, does it work for you? diff --git a/ggml-metal.m b/ggml-metal.m
index 1139ee31..9147a92f 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -441,7 +441,7 @@ bool ggml_metal_add_buffer(
}
// the buffer fits into the max buffer size allowed by the device
- if (size_aligned <= ctx->device.maxBufferLength) {
+ if (size_aligned <= ctx->device.maxBufferLength/2) {
ctx->buffers[ctx->n_buffers].name = name;
ctx->buffers[ctx->n_buffers].data = data;
ctx->buffers[ctx->n_buffers].size = size;
@@ -460,8 +460,8 @@ bool ggml_metal_add_buffer(
// this overlap between the views will guarantee that the tensor with the maximum size will fully fit into
// one of the views
const size_t size_ovlp = ((max_size + size_page - 1) / size_page + 1) * size_page; // round-up 2 pages just in case
- const size_t size_step = ctx->device.maxBufferLength - size_ovlp;
- const size_t size_view = ctx->device.maxBufferLength;
+ const size_t size_step = ctx->device.maxBufferLength/2 - size_ovlp;
+ const size_t size_view = ctx->device.maxBufferLength/2;
for (size_t i = 0; i < size; i += size_step) {
const size_t size_step_aligned = (i + size_view <= size) ? size_view : (size_aligned - i); Make sure to use the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello! I recently was informed that there's a limit of how much RAM a Mac's GPU can use in comparison to the total available RAM.

Rumors were that limit is either 66% or 75%, so I ran some tests and found out they were actually accurate:
Is the llama.cpp team aware of this? And if so, are there any official numbers on those limits? Any technical explanation behind them?
Would this info change the fact that offloading to CPU or putting effort towards combining CPU & GPU would be useful? (CC #3083)
Beta Was this translation helpful? Give feedback.
All reactions