-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Speed up image encode with Metal.
Motivation
I want to run multimodal vision model on Mac M2 and NV 4090 with
./build/bin/llama-mtmd-cli
Currently the speed of one image slice encode is slow, one slice took 5000+ ms (on Metal), but on CUDA, it only needs 170+ms for each slice. It looks the Metal GPU is not well used.
Below are detailed trace:
======= Metal =============
clip_model_loader: model name:
clip_model_loader: description: image encoder for MiniCPM-V
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 455
clip_model_loader: n_kv: 19
clip_model_loader: has vision encoder
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: GPU name: Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets = true
ggml_metal_init: has bfloat = true
ggml_metal_init: use bfloat = false
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
clip_ctx: CLIP using Metal backend
load_hparams: projector: resampler
load_hparams: n_embd: 1152
load_hparams: n_head: 16
load_hparams: n_ff: 4304
load_hparams: n_layer: 27
load_hparams: ffn_op: gelu
load_hparams: projection_dim: 0
--- vision hparams ---
load_hparams: image_size: 448
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 4
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 0
load_hparams: model size: 996.02 MiB
load_hparams: metadata size: 0.16 MiB
alloc_compute_meta: Metal compute buffer size = 98.30 MiB
alloc_compute_meta: CPU compute buffer size = 16.30 MiB
main: loading model: /Users/a0/Downloads/models/MiniCPM-o-2_6-gguf/Model-7.6B-Q4_K_M.gguf
encoding image slice...
image slice encoded in 5446 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 394 ms
========== CUDA =============
clip_model_loader: model name:
clip_model_loader: description: image encoder for MiniCPM-V
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 455
clip_model_loader: n_kv: 19
clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector: resampler
load_hparams: n_embd: 1152
load_hparams: n_head: 16
load_hparams: n_ff: 4304
load_hparams: n_layer: 26
load_hparams: ffn_op: gelu
load_hparams: projection_dim: 0
--- vision hparams ---
load_hparams: image_size: 448
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 3
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 0
load_hparams: model size: 996.02 MiB
load_hparams: metadata size: 0.16 MiB
alloc_compute_meta: CUDA0 compute buffer size = 98.30 MiB
alloc_compute_meta: CPU compute buffer size = 16.30 MiB
main: loading model: /cache/zhanglei/models/MiniCPM-V-2_6/gguf/ggml-model-Q4_K_M.gguf
encoding image slice...
image slice encoded in 121 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 68 ms
Possible Implementation
no idea.