Replies: 1 comment
-
I think encoding the command buffers in parallel was recommended in some documentation (#1860) but I don't keep a reference |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
In the ggml-metal.m, there is a code to submit several metal work chunks in parallel:
const int n_cb = ctx->n_cb;
..
for (int cb_idx = 0; cb_idx < n_cb; ++cb_idx) { // create multiple command buffers
..
dispatch_apply(n_cb, ctx->d_queue, ^(size_t iter) { // run multiple command buffers
When I was trying to experiment with setting different tctx->n_cb for different LLMs on different Mac machines, I found very little performance difference between settings, e.g. n_cb=1 was always as efficient as default n_cb= 64.
There is noticeable difference when inference is running while competing with other workloads that are heavy GPU users, but I suspect that was not the primary intent of implementing multiple buffers support.
What type of workload am I missing that demonstrates performance advantage for having multiple metal command buffers?
Beta Was this translation helpful? Give feedback.
All reactions