Inconsistencies in the value of n_embd_head_v in llama_model hparams. #9890

J4e6eR · 2024-10-14T21:22:27Z

J4e6eR
Oct 14, 2024

I've been using the simple.cpp example from the llama.cpp repository on GitHub to gain insight into the library's internal workings. I'm compiling the code with g++ and dynamically linking against libllama.so and libggml.so. However, I've observed an anomaly in the value of n_embd_head_v immediately before calling llama_decode(ctx, batch).

Specifically, I've noticed that the value of ctx->model.hparams.n_embd_head_v is inconsistent. Just before entering the llama_decode function, its value is 0, but upon inspecting the same value within the llama_decode function, it magically changes to 128. I'm struggling to understand why this value is being transformed in this way.

Inside llama.cpp

int32_t llama_decode(
        struct llama_context * ctx,
          struct llama_batch   batch) {
    printf("%s: What is the value of n_embd_head_v = %u\n", __func__, ctx->model.hparams.n_embd_head_v);
    // Main implementation

Inside simple.cpp

for (int n_pos = 0; n_pos + batch.n_tokens < n_prompt + n_predict; ) {
        // evaluate the current batch with the transformer model
        printf("%s n_embd_head_v just before llama decode = %u\n", __func__, ctx->model.hparams.n_embd_head_v);
        if (llama_decode(ctx, batch)) {
            fprintf(stderr, "%s : failed to eval, return code %d\n", __func__, 1);
            return 1;
        }
        // Cont. Impl

Furthermore, I am facing some issues with dynamically linking internal functions of llama.cpp which are not defined in and readily accessible from llama.h.
Please help me figure this out.

Thank you in advance for your time and assistance!

danbev · 2024-10-15T10:34:25Z

danbev
Oct 15, 2024
Collaborator

It sounds like there might be something wrong with how you are compiling the program. I don't think that it should be possible for the simple.cpp program to access the n_embd_head_v field like this, and there should be a compilation error something like this:

$ cmake --build build --target llama-simple
[ 54%] Built target ggml
[ 81%] Built target llama
[ 90%] Building CXX object examples/simple/CMakeFiles/llama-simple.dir/simple.cpp.o
/Users/danbev/work/llama.cpp/examples/simple/simple.cpp:151:74: error: member access into incomplete type 'llama_context'
        printf("%s n_embd_head_v just before llama decode = %u\n", __func__, ctx->model.hparams.n_embd_head_v);
                                                                                ^
/Users/danbev/work/llama.cpp/src/../include/llama.h:61:12: note: forward declaration of 'llama_context'
    struct llama_context;
           ^
1 error generated.
make[3]: *** [examples/simple/CMakeFiles/llama-simple.dir/simple.cpp.o] Error 1
make[2]: *** [examples/simple/CMakeFiles/llama-simple.dir/all] Error 2
make[1]: *** [examples/simple/CMakeFiles/llama-simple.dir/rule] Error 2
make: *** [llama-simple] Error 2

Without that printf statement in simple.cpp, we inspect this field in the debugger is set to 128 and not 0 like you are seeing:

$ lldb build/bin/llama-simple -- -m models/llama-2-7b.Q4_K_M.gguf -p "What is LoRA?"
(lldb) target create "build/bin/llama-simple"
Current executable set to '/Users/danbev/work/llama.cpp/build/bin/llama-simple' (arm64).
(lldb) settings set -- target.run-args  "-m" "models/llama-2-7b.Q4_K_M.gguf" "-p" "What is LoRA?"
(lldb) br set -f simple.cpp -l 151
Breakpoint 1: where = llama-simple`main + 2364 at simple.cpp:152:26, address = 0x0000000100005ccc
(lldb) r
Process 90708 launched: '/Users/danbev/work/llama.cpp/build/bin/llama-simple' (arm64)
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3)
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from models/llama-2-7b.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.27 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =  3577.56 MiB, ( 3577.64 / 16384.02)
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      Metal buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx      = 64
llama_new_context_with_model: n_batch    = 32
llama_new_context_with_model: n_ubatch   = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3
ggml_metal_init: picking default device: Apple M3
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M3
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 17179.89 MB
llama_kv_cache_init:      Metal KV buffer size =    32.00 MiB
llama_new_context_with_model: KV self size  =   32.00 MiB, K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
ggml_gallocr_reserve_n: reallocating Metal buffer from size 0.00 MiB to 4.41 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 0.51 MiB
llama_new_context_with_model:      Metal compute buffer size =     4.41 MiB
llama_new_context_with_model:        CPU compute buffer size =     0.51 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
Process 90708 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x0000000100005ccc llama-simple`main(argc=5, argv=0x000000016fdff298) at simple.cpp:152:26
   149 	    for (int n_pos = 0; n_pos + batch.n_tokens < n_prompt + n_predict; ) {
   150 	        // evaluate the current batch with the transformer model
   151 		//printf("%s n_embd_head_v just before llama decode = %u\n", __func__, ctx->model.hparams.n_embd_head_v);
-> 152 	        if (llama_decode(ctx, batch)) {
   153 	            fprintf(stderr, "%s : failed to eval, return code %d\n", __func__, 1);
   154 	            return 1;
   155 	        }
Target 0: (llama-simple) stopped.
(lldb) p ctx->model->hparams->n_embd_head_v
(const uint32_t) 128

Perhaps double checking the compilation command that are being used can help sort this out.
In case it helps here is a repository that I used to try to reproduce the error you are seeing but I was not able to.

3 replies

J4e6eR Oct 15, 2024
Author

Thank you for your response, @danbev. However, I'd like to clarify that I'm not using the Makefile provided by llama.cpp to compile my code.
To give you a better understanding of my workflow, I'll outline the steps I take. First, I build the library using the following commands, which generates the libllama.so and libggml.so files:

cmake -DLLAMA_STATIC=OFF -DBUILD_SHARED_LIBS=ON -DCMAKE_CXX_FLAGS=-fPIC -DCMAKE_C_FLAGS=-fPIC -DLLAMA_CCACHE=OFF -B build
cmake --build build --config Release

This compilation process completes without errors, and the necessary files are generated. Next, I copy the contents of simple.cpp to a different directory and attempt to compile it using the following command:

g++ -w -o test.out simple.cpp  -lllama -lggml -ldl -I/llama.cpp/ggml/include -I/llama.cpp/include && ./test.out -m models/llama-2-7b.Q4_K_M.gguf -p "What is LoRA?"

This is what I meant by compiling with g++ and dynamically linking against the .so files. At this point, the compiled output file runs as expected and I start getting the response from the model. The inconsistencies in the variable value start to bother me when I try to call some internal functions which llama_decode calls internally during each token generation.

For instance, when I explicitly tried to call static struct ggml_cgraph * llama_build_graph, I was being caught up by the folowing assertion statements in llm.build_llama() as the value of hparams.n_embd_head_v was reported as 0.

const int64_t n_embd_head = hparams.n_embd_head_v;
GGML_ASSERT(n_embd_head == hparams.n_embd_head_k);
GGML_ASSERT(n_embd_head == hparams.n_rot);

Note that the llama_decode would internally call this function any way and it passes the assertion as the value of hparams.n_embd_head_v from that function is 128.
Moreover, how can I access internal functions of llama.cpp from an external cpp file as majority of functions are static.

If it's okay on your side, can we have a call over discord or something which might help me debug and solve this error?
Thank you in advance for your time and assistance!

danbev Oct 15, 2024
Collaborator

For instance, when I explicitly tried to call static struct ggml_cgraph * llama_build_graph, I was being caught up by the folowing assertion statements in llm.build_llama() as the value of hparams.n_embd_head_v was reported as 0.

This function is not exposed by llama.h and it should not be possible to call it directly, you would get a compilation error (like 'llama_build_graph’ was not declared in this scope"). You can call functions that call this indirectly by using the functions exposed by llama.h.

Moreover, how can I access internal functions of llama.cpp from an external cpp file as majority of functions are static.

You access them via the exposed functions in llama.h. If you need to build a computation graph you would have to use the ggml library functions directly I think.

J4e6eR Oct 15, 2024
Author

If its okay on your side @danbev, can you follow my steps and run the program and see if there are inconsistencies in the variable value by adding a printf statement along with debugger just before llama_decode?

So I might know if the issue is really there or its just that my g++ compiler is malfunctioning!!
Thank you in advance for your time and assistance!

danbev · 2024-10-16T06:34:14Z

danbev
Oct 16, 2024
Collaborator

I've used the same compilation command as you did but also added debug symbols (-g).

First we can inspect the value of n_embd_head_v in simple.cpp:

(gdb) f
#0  main (argc=7, argv=0x7fffffffdb38) at simple.cpp:154
154	        if (llama_decode(ctx, batch)) {
(gdb) p ctx->model->hparams->n_embd_head_v
$1 = 128

(gdb) p &ctx->model->hparams
$5 = (llama_hparams *) 0x55555592b5f0

And then in llama.cpp with the printf statement (flushing stdout for clarity):

llama_decode (ctx=0x55555595ccc0, batch=...) at /home/danbev/work/ai/llama.cpp/src/llama.cpp:21234
21234	    printf("%s: What is the value of n_embd_head_v = %u\n", __func__, ctx->model.hparams.n_embd_head_v);
(gdb) p fflush(stdout)
<s> -p What is LoRA?$8 = 0

(gdb) n
llama_decode: What is the value of n_embd_head_v = 128

(gdb) p &ctx->model->hparams
$6 = (llama_hparams *) 0x55555592b5f0

So I'm not able to reproduce your original issue, and this seems to work as expected so perhaps there is an environment issue causing this.

1 reply

J4e6eR Oct 17, 2024
Author

Thanks for your response @danbev, I suspect it might be environment related issue as you mentioned.
No worries, I am looking into it, would let you know about my progresses.

Once again, Thanks for the prompt reply to my issues!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistencies in the value of n_embd_head_v in llama_model hparams. #9890

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Inconsistencies in the value of n_embd_head_v in llama_model hparams. #9890

Uh oh!

J4e6eR Oct 14, 2024

Replies: 2 comments · 4 replies

Uh oh!

danbev Oct 15, 2024 Collaborator

Uh oh!

J4e6eR Oct 15, 2024 Author

Uh oh!

danbev Oct 15, 2024 Collaborator

Uh oh!

J4e6eR Oct 15, 2024 Author

Uh oh!

danbev Oct 16, 2024 Collaborator

Uh oh!

J4e6eR Oct 17, 2024 Author

J4e6eR
Oct 14, 2024

Replies: 2 comments 4 replies

danbev
Oct 15, 2024
Collaborator

J4e6eR Oct 15, 2024
Author

danbev Oct 15, 2024
Collaborator

J4e6eR Oct 15, 2024
Author

danbev
Oct 16, 2024
Collaborator

J4e6eR Oct 17, 2024
Author