Metal performance when using llama-cpp-python wrapper tanked from 0.2.28 to 0.2.29 (I know this is a llama.cpp core discussion.. but..) #6113

agnosticlines · 2024-03-17T15:11:39Z

agnosticlines
Mar 17, 2024

Hey all,

I know this is a llama.cpp core project and this issue is about the python wrapper not by the same developers but I've been looking into this issue for a few days and it's kind of stumped me, I don't suppose anyone here with their knowledge of the llama.cpp codebase could think of why performance would tank on metal from version 6efb8eb to 4483396 of llama.cpp?

There's an issue here too, I've also done a little bit of digging here I'm just a bit stumped and would love some smart people to help :)

ggerganov · 2024-03-17T17:43:20Z

ggerganov
Mar 17, 2024
Maintainer

Can you confirm that the GPU is used by observing the activity monitor?

6 replies

ggerganov Mar 17, 2024
Maintainer

If it is not 100%, then probably the n_gpu_layers parameter is not passed correctly. Look into this and make sure a large value (like 99) goes into llama.cpp

agnosticlines Mar 17, 2024
Author

It definitely does get passed, I looked into that, the llama.cpp debug output on run shows that:

2.29:

llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 30.94 GiB (5.69 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.76 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 31600.27 MiB, (31600.33 / 60000.00)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    85.94 MiB
llm_load_tensors:      Metal buffer size = 31600.25 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/m1/python/python3.11/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 62914.56 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  4096.00 MiB, (35697.89 / 60000.00)
llama_kv_cache_init:      Metal KV buffer size =  4096.00 MiB
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (35697.91 / 60000.00)
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  2164.05 MiB, (37861.94 / 60000.00)
llama_new_context_with_model: graph splits (measure): 3
llama_new_context_with_model:      Metal compute buffer size =  2164.03 MiB
llama_new_context_with_model:        CPU compute buffer size =    72.00 MiB
AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
17:38:25-343799 INFO     LOADER: "llamacpp_HF"
17:38:25-344328 INFO     TRUNCATION LENGTH: 32768
17:38:25-344664 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
17:38:25-345103 INFO     Loaded the model in 0.49 seconds.

2.28:

llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 30.94 GiB (5.69 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.38 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 31686.94 MiB, (31687.00 / 60000.00)
llm_load_tensors: system memory used  = 31686.56 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/m1/python/python3.11/site-packages/llama_cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 62914.56 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  4096.00 MiB, (35784.56 / 60000.00)
llama_new_context_with_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     0.02 MiB, (35784.58 / 60000.00)
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 2167.22 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =  2164.05 MiB, (37948.61 / 60000.00)
AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |

The only thing that I see that I don't see in the older version is:

llama_new_context_with_model: graph splits (measure): 3
llama_new_context_with_model:      Metal compute buffer size =  2164.03 MiB
llama_new_context_with_model:        CPU compute buffer size =    72.00 MiB

agnosticlines Mar 20, 2024
Author

@ggerganov I had a look and produced some flame graphs, I know you're super busy but would you mind having a quick look and seeing if you spot anything obviously out of the ordinary? It does go up to 100%, but that seems to be consistent with the older versions too. I'm just a little concerned that people running it are running an older version that's vulnerable to lots of reported bugs, and also we're missing out on the awesome Cohere stuff :)

Graphs are here: abetlen/llama-cpp-python#1117 (comment)

It can't be a core bug in llama.cpp because people are successfully running the latest llama.cpp on metal with the same performance as running the older llama-cpp-python version, so it's gotta be something in either the build https://github.com/abetlen/llama-cpp-python/blob/main/CMakeLists.txt, or the invocation of llama.cpp right?

I sent you some coffee on buymeacoffee btw, even if you don't manage to take a look at it, thanks for all your fantastic work

ggerganov Mar 20, 2024
Maintainer

I just installed llama-cpp-python and I don't see a slowdown:

python3 -m venv x
source x/bin/activate

CMAKE_ARGS="-DLLAMA_METAL_EMBED_LIBRARY=ON -DLLAMA_METAL=on" pip install -U llama-cpp-python --no-cache-dir

from llama_cpp import Llama
llm = Llama(
      model_path="../llama.cpp/models/llama-7b-v2/ggml-model-f16.gguf",
      n_gpu_layers=99, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)

# heat-up
output = llm("dummy", max_tokens=32)

output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion

print(output)

python3 test.py

llama_print_timings:        load time =     421.62 ms
llama_print_timings:      sample time =       3.03 ms /    32 runs   (    0.09 ms per token, 10571.52 tokens per second)
llama_print_timings: prompt eval time =      54.15 ms /    14 tokens (    3.87 ms per token,   258.56 tokens per second)
llama_print_timings:        eval time =     735.06 ms /    31 runs   (   23.71 ms per token,    42.17 tokens per second)
llama_print_timings:       total time =     827.17 ms /    45 tokens

Using llama.cpp with main:

./main -m models/llama-7b-v2/ggml-model-f16.gguf -p "Q: Name the planets in the solar system? A:" -n 64

llama_print_timings:        load time =     506.91 ms
llama_print_timings:      sample time =       4.80 ms /    64 runs   (    0.07 ms per token, 13344.45 tokens per second)
llama_print_timings: prompt eval time =      54.10 ms /    14 tokens (    3.86 ms per token,   258.80 tokens per second)
llama_print_timings:        eval time =    1524.06 ms /    63 runs   (   24.19 ms per token,    41.34 tokens per second)
llama_print_timings:       total time =    1588.54 ms /    77 tokens

agnosticlines Mar 20, 2024
Author

The issue only seems to occur when the prompt is much larger, it's also about as quick for me using a short prompt, but when I have like 6k-10k tokens it's a slog

Metal performance when using llama-cpp-python wrapper tanked from 0.2.28 to 0.2.29 (I know this is a llama.cpp core discussion.. but..) #6113

Uh oh!

Uh oh!

agnosticlines Mar 17, 2024

Replies: 1 comment · 6 replies

Uh oh!

ggerganov Mar 17, 2024 Maintainer

Uh oh!

ggerganov Mar 17, 2024 Maintainer

Uh oh!

Uh oh!

agnosticlines Mar 17, 2024 Author

Uh oh!

Uh oh!

agnosticlines Mar 20, 2024 Author

Uh oh!

Uh oh!

ggerganov Mar 20, 2024 Maintainer

Uh oh!

agnosticlines Mar 20, 2024 Author

agnosticlines
Mar 17, 2024

Replies: 1 comment 6 replies

ggerganov
Mar 17, 2024
Maintainer

ggerganov Mar 17, 2024
Maintainer

agnosticlines Mar 17, 2024
Author

agnosticlines Mar 20, 2024
Author

ggerganov Mar 20, 2024
Maintainer

agnosticlines Mar 20, 2024
Author