Question about metal parallel command buffers processing #6090

izard · 2024-03-15T19:50:54Z

izard
Mar 15, 2024

In the ggml-metal.m, there is a code to submit several metal work chunks in parallel:
const int n_cb = ctx->n_cb;
..
for (int cb_idx = 0; cb_idx < n_cb; ++cb_idx) { // create multiple command buffers
..
dispatch_apply(n_cb, ctx->d_queue, ^(size_t iter) { // run multiple command buffers
When I was trying to experiment with setting different tctx->n_cb for different LLMs on different Mac machines, I found very little performance difference between settings, e.g. n_cb=1 was always as efficient as default n_cb= 64.

There is noticeable difference when inference is running while competing with other workloads that are heavy GPU users, but I suspect that was not the primary intent of implementing multiple buffers support.

What type of workload am I missing that demonstrates performance advantage for having multiple metal command buffers?

ggerganov · 2024-03-16T16:05:20Z

ggerganov
Mar 16, 2024
Maintainer

I think encoding the command buffers in parallel was recommended in some documentation (#1860) but I don't keep a reference

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about metal parallel command buffers processing #6090

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about metal parallel command buffers processing #6090

Uh oh!

izard Mar 15, 2024

Replies: 1 comment

Uh oh!

ggerganov Mar 16, 2024 Maintainer

izard
Mar 15, 2024

ggerganov
Mar 16, 2024
Maintainer