Feature Request: Speed up image encode with Metal

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Speed up image encode with Metal.

### Motivation

I want to run multimodal vision model on Mac M2 and NV 4090 with 
./build/bin/llama-mtmd-cli
Currently the speed of one image slice encode is slow, one slice took 5000+ ms (on Metal), but on CUDA, it only needs 170+ms for each slice. It looks the Metal GPU is not well used.
Below are detailed trace:
======= Metal =============
clip_model_loader: model name:
clip_model_loader: description:  image encoder for MiniCPM-V
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    455
clip_model_loader: n_kv:         19

clip_model_loader: has vision encoder
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: GPU name:   Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
clip_ctx: CLIP using Metal backend
load_hparams: projector:          resampler
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            27
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     0

--- vision hparams ---
load_hparams: image_size:         448
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   4
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       0

load_hparams: model size:         996.02 MiB
load_hparams: metadata size:      0.16 MiB
alloc_compute_meta:      Metal compute buffer size =    98.30 MiB
alloc_compute_meta:        CPU compute buffer size =    16.30 MiB
main: loading model: /Users/a0/Downloads/models/MiniCPM-o-2_6-gguf/Model-7.6B-Q4_K_M.gguf
encoding image slice...
image slice encoded in 5446 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 394 ms

========== CUDA =============
clip_model_loader: model name:
clip_model_loader: description:  image encoder for MiniCPM-V
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    455
clip_model_loader: n_kv:         19

clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          resampler
load_hparams: n_embd:             1152
load_hparams: n_head:             16
load_hparams: n_ff:               4304
load_hparams: n_layer:            26
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     0

--- vision hparams ---
load_hparams: image_size:         448
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   3
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       0

load_hparams: model size:         996.02 MiB
load_hparams: metadata size:      0.16 MiB
alloc_compute_meta:      CUDA0 compute buffer size =    98.30 MiB
alloc_compute_meta:        CPU compute buffer size =    16.30 MiB
main: loading model: /cache/zhanglei/models/MiniCPM-V-2_6/gguf/ggml-model-Q4_K_M.gguf
encoding image slice...
image slice encoded in 121 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 68 ms

### Possible Implementation

no idea.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Speed up image encode with Metal #14527

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Speed up image encode with Metal #14527

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions