Replies: 2 comments 4 replies
-
It sounds like there might be something wrong with how you are compiling the program. I don't think that it should be possible for the simple.cpp program to access the $ cmake --build build --target llama-simple
[ 54%] Built target ggml
[ 81%] Built target llama
[ 90%] Building CXX object examples/simple/CMakeFiles/llama-simple.dir/simple.cpp.o
/Users/danbev/work/llama.cpp/examples/simple/simple.cpp:151:74: error: member access into incomplete type 'llama_context'
printf("%s n_embd_head_v just before llama decode = %u\n", __func__, ctx->model.hparams.n_embd_head_v);
^
/Users/danbev/work/llama.cpp/src/../include/llama.h:61:12: note: forward declaration of 'llama_context'
struct llama_context;
^
1 error generated.
make[3]: *** [examples/simple/CMakeFiles/llama-simple.dir/simple.cpp.o] Error 1
make[2]: *** [examples/simple/CMakeFiles/llama-simple.dir/all] Error 2
make[1]: *** [examples/simple/CMakeFiles/llama-simple.dir/rule] Error 2
make: *** [llama-simple] Error 2 Without that $ lldb build/bin/llama-simple -- -m models/llama-2-7b.Q4_K_M.gguf -p "What is LoRA?"
(lldb) target create "build/bin/llama-simple"
Current executable set to '/Users/danbev/work/llama.cpp/build/bin/llama-simple' (arm64).
(lldb) settings set -- target.run-args "-m" "models/llama-2-7b.Q4_K_M.gguf" "-p" "What is LoRA?"
(lldb) br set -f simple.cpp -l 151
Breakpoint 1: where = llama-simple`main + 2364 at simple.cpp:152:26, address = 0x0000000100005ccc
(lldb) r
Process 90708 launched: '/Users/danbev/work/llama.cpp/build/bin/llama-simple' (arm64)
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3)
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llama-2-7b.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: EOG token = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size = 0.27 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size = 3577.56 MiB, ( 3577.64 / 16384.02)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: Metal buffer size = 3577.56 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
..................................................................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx = 64
llama_new_context_with_model: n_batch = 32
llama_new_context_with_model: n_ubatch = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3
ggml_metal_init: picking default device: Apple M3
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name: Apple M3
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction support = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 17179.89 MB
llama_kv_cache_init: Metal KV buffer size = 32.00 MiB
llama_new_context_with_model: KV self size = 32.00 MiB, K (f16): 16.00 MiB, V (f16): 16.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
ggml_gallocr_reserve_n: reallocating Metal buffer from size 0.00 MiB to 4.41 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.51 MiB
llama_new_context_with_model: Metal compute buffer size = 4.41 MiB
llama_new_context_with_model: CPU compute buffer size = 0.51 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
Process 90708 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x0000000100005ccc llama-simple`main(argc=5, argv=0x000000016fdff298) at simple.cpp:152:26
149 for (int n_pos = 0; n_pos + batch.n_tokens < n_prompt + n_predict; ) {
150 // evaluate the current batch with the transformer model
151 //printf("%s n_embd_head_v just before llama decode = %u\n", __func__, ctx->model.hparams.n_embd_head_v);
-> 152 if (llama_decode(ctx, batch)) {
153 fprintf(stderr, "%s : failed to eval, return code %d\n", __func__, 1);
154 return 1;
155 }
Target 0: (llama-simple) stopped.
(lldb) p ctx->model->hparams->n_embd_head_v
(const uint32_t) 128 Perhaps double checking the compilation command that are being used can help sort this out. |
Beta Was this translation helpful? Give feedback.
-
I've used the same compilation command as you did but also added debug symbols (-g). First we can inspect the value of (gdb) f
#0 main (argc=7, argv=0x7fffffffdb38) at simple.cpp:154
154 if (llama_decode(ctx, batch)) {
(gdb) p ctx->model->hparams->n_embd_head_v
$1 = 128
(gdb) p &ctx->model->hparams
$5 = (llama_hparams *) 0x55555592b5f0 And then in llama.cpp with the llama_decode (ctx=0x55555595ccc0, batch=...) at /home/danbev/work/ai/llama.cpp/src/llama.cpp:21234
21234 printf("%s: What is the value of n_embd_head_v = %u\n", __func__, ctx->model.hparams.n_embd_head_v);
(gdb) p fflush(stdout)
<s> -p What is LoRA?$8 = 0
(gdb) n
llama_decode: What is the value of n_embd_head_v = 128
(gdb) p &ctx->model->hparams
$6 = (llama_hparams *) 0x55555592b5f0 So I'm not able to reproduce your original issue, and this seems to work as expected so perhaps there is an environment issue causing this. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been using the
simple.cpp
example from the llama.cpp repository on GitHub to gain insight into the library's internal workings. I'm compiling the code with g++ and dynamically linking against libllama.so and libggml.so. However, I've observed an anomaly in the value ofn_embd_head_v
immediately before callingllama_decode(ctx, batch)
.Specifically, I've noticed that the value of
ctx->model.hparams.n_embd_head_v
is inconsistent. Just before entering thellama_decode
function, its value is 0, but upon inspecting the same value within thellama_decode
function, it magically changes to 128. I'm struggling to understand why this value is being transformed in this way.Inside llama.cpp
Inside simple.cpp
Furthermore, I am facing some issues with dynamically linking internal functions of llama.cpp which are not defined in and readily accessible from llama.h.
Please help me figure this out.
Thank you in advance for your time and assistance!
Beta Was this translation helpful? Give feedback.
All reactions