Does finetune support phi-2 #4995

candcconsulting · 2024-01-17T05:28:08Z

candcconsulting
Jan 17, 2024

I recently ran a finetune on a mistral model and all seems great.

However, when I run the same text on the phi-2, I obtain the following log when running a test prompt
<main.log added as comment>
mainly these two lines
[1705465495] llama_apply_lora_from_file_internal: incompatible tensor dimensions (2560 and 4096); are you sure that this adapter is for this model?
[1705465496] main: error: unable to load model

Is there something I have to set to make the tensor dimensions the same. When I run the same test on the mistral LORA
llama_apply_lora_from_file_internal: warning: using a lora adapter with a quantized model may result in poor quality, use a f16 or f32 base model with --lora-base

Also note when I try to merge them I get a general assert
llama.cpp\export-lora.exe --model-base models\phi-2.Q5_K_M.gguf --model-out models\phi-2-ecsql.Q5_K_M.gguf --lora-scaled lora\phi2\ggml-lora-150-f32.gguf 1.0
Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
GGML_ASSERT: F:\development\GPT\llama.cpp\ggml.c:3190: ggml_can_repeat(b, a)
phi2.main.log

candcconsulting · 2024-01-17T05:28:42Z

candcconsulting
Jan 17, 2024
Author

[1705465454] Log start
[1705465454] Cmd: f:\development\gpt\llama.cpp\main.exe --model F:\GPT\models\microsoft-phi2-ecsql.gguf --lora F:\GPT\lora\phi2\ggml-lora-140-f32.gguf --prompt "What can you tell me about ecSQL"
[1705465454] main: build = 1820 (2f04332)
[1705465454] main: built with MSVC 19.32.31332.0 for x64
[1705465454] main: seed = 1705465454
[1705465454] main: llama backend init
[1705465456] main: load the model and apply lora adapter, if any
[1705465456] llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from F:\GPT\models\microsoft-phi2-ecsql.gguf (version GGUF V3 (latest))
[1705465456] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[1705465456] llama_model_loader: - kv 0: general.architecture str = phi2
[1705465456] llama_model_loader: - kv 1: general.name str = Phi2
[1705465456] llama_model_loader: - kv 2: phi2.context_length u32 = 2048
[1705465456] llama_model_loader: - kv 3: phi2.embedding_length u32 = 2560
[1705465456] llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 10240
[1705465456] llama_model_loader: - kv 5: phi2.block_count u32 = 32
[1705465456] llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32
[1705465456] llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32
[1705465456] llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010
[1705465456] llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32
[1705465456] llama_model_loader: - kv 10: general.file_type u32 = 1
[1705465456] llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false
[1705465456] llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2
[1705465456] llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", """, "#", "$", "%", "&", "'", ...
[1705465456] llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[1705465456] llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
[1705465456] llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256
[1705465456] llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256
[1705465456] llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256
[1705465456] llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 50256
[1705465456] llama_model_loader: - type f32: 195 tensors
[1705465456] llama_model_loader: - type f16: 130 tensors
[1705465456] llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
[1705465456] llm_load_print_meta: format = GGUF V3 (latest)
[1705465456] llm_load_print_meta: arch = phi2
[1705465456] llm_load_print_meta: vocab type = BPE
[1705465456] llm_load_print_meta: n_vocab = 51200
[1705465456] llm_load_print_meta: n_merges = 50000
[1705465456] llm_load_print_meta: n_ctx_train = 2048
[1705465456] llm_load_print_meta: n_embd = 2560
[1705465456] llm_load_print_meta: n_head = 32
[1705465456] llm_load_print_meta: n_head_kv = 32
[1705465456] llm_load_print_meta: n_layer = 32
[1705465456] llm_load_print_meta: n_rot = 32
[1705465456] llm_load_print_meta: n_embd_head_k = 80
[1705465456] llm_load_print_meta: n_embd_head_v = 80
[1705465456] llm_load_print_meta: n_gqa = 1
[1705465456] llm_load_print_meta: n_embd_k_gqa = 2560
[1705465456] llm_load_print_meta: n_embd_v_gqa = 2560
[1705465456] llm_load_print_meta: f_norm_eps = 1.0e-05
[1705465456] llm_load_print_meta: f_norm_rms_eps = 0.0e+00
[1705465456] llm_load_print_meta: f_clamp_kqv = 0.0e+00
[1705465456] llm_load_print_meta: f_max_alibi_bias = 0.0e+00
[1705465456] llm_load_print_meta: n_ff = 10240
[1705465456] llm_load_print_meta: n_expert = 0
[1705465456] llm_load_print_meta: n_expert_used = 0
[1705465456] llm_load_print_meta: rope scaling = linear
[1705465456] llm_load_print_meta: freq_base_train = 10000.0
[1705465456] llm_load_print_meta: freq_scale_train = 1
[1705465456] llm_load_print_meta: n_yarn_orig_ctx = 2048
[1705465456] llm_load_print_meta: rope_finetuned = unknown
[1705465456] llm_load_print_meta: model type = 3B
[1705465456] llm_load_print_meta: model ftype = F16
[1705465456] llm_load_print_meta: model params = 2.78 B
[1705465456] llm_load_print_meta: model size = 5.18 GiB (16.01 BPW)
[1705465456] llm_load_print_meta: general.name = Phi2
[1705465456] llm_load_print_meta: BOS token = 50256 '<|endoftext|>'
[1705465456] llm_load_print_meta: EOS token = 50256 '<|endoftext|>'
[1705465456] llm_load_print_meta: UNK token = 50256 '<|endoftext|>'
[1705465456] llm_load_print_meta: PAD token = 50256 '<|endoftext|>'
[1705465456] llm_load_print_meta: LF token = 30 '?'
[1705465456] llm_load_tensors: ggml ctx size = 0.12 MiB
[1705465456] llm_load_tensors: using CUDA for GPU acceleration
[1705465460] llm_load_tensors: system memory used = 5303.78 MiB
[1705465460] llm_load_tensors: offloading 0 repeating layers to GPU
[1705465460] llm_load_tensors: offloaded 0/33 layers to GPU
[1705465462] .[1705465462] .[1705465463] .[1705465463] .[1705465463] .[1705465464] .[1705465464] .[1705465464] .[1705465464] .[1705465465] .[1705465465] .[1705465465] .[1705465466] .[1705465466] .[1705465467] .[1705465467] .[1705465467] .[1705465468] .[1705465468] .[1705465468] .[1705465469] .[1705465469] .[1705465469] .[1705465470] .[1705465470] .[1705465470] .[1705465470] .[1705465471] .[1705465471] .[1705465471] .[1705465472] .[1705465472] .[1705465473] .[1705465473] .[1705465473] .[1705465473] .[1705465474] .[1705465474] .[1705465474] .[1705465475] .[1705465475] .[1705465475] .[1705465476] .[1705465476] .[1705465476] .[1705465477] .[1705465477] .[1705465478] .[1705465478] .[1705465478] .[1705465479] .[1705465479] .[1705465479] .[1705465480] .[1705465480] .[1705465480] .[1705465480] .[1705465481] .[1705465481] .[1705465482] .[1705465482] .[1705465482] .[1705465482] .[1705465483] .[1705465483] .[1705465484] .[1705465484] .[1705465484] .[1705465485] .[1705465485] .[1705465485] .[1705465486] .[1705465486] .[1705465486] .[1705465487] .[1705465487] .[1705465487] .[1705465488] .[1705465488] .[1705465488] .[1705465489] .[1705465489] .[1705465489] .[1705465490] .[1705465490] .[1705465492] .[1705465492] .[1705465492] .[1705465493] .[1705465493] .[1705465493] .[1705465494] .[1705465494] .[1705465494]
[1705465494] llama_new_context_with_model: n_ctx = 512
[1705465494] llama_new_context_with_model: freq_base = 10000.0
[1705465494] llama_new_context_with_model: freq_scale = 1
[1705465494] llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB
[1705465494] llama_build_graph: non-view tensors processed: 774/774
[1705465494] llama_new_context_with_model: compute buffer total size = 113.19 MiB
[1705465494] llama_apply_lora_from_file_internal: applying lora adapter from 'F:\GPT\lora\phi2\ggml-lora-140-f32.gguf' - please wait ...
[1705465494] llama_apply_lora_from_file_internal: r = 4, alpha = 4, scaling = 1.00
[1705465494] llama_apply_lora_from_file_internal: allocating 1500 MB for lora temporary buffer
[1705465495] llama_apply_lora_from_file_internal: incompatible tensor dimensions (2560 and 4096); are you sure that this adapter is for this model?
[1705465496] main: error: unable to load model

0 replies

potaninmt · 2024-01-19T01:36:15Z

potaninmt
Jan 19, 2024

I have same error

0 replies

MotorCityCobra · 2024-01-25T23:33:23Z

MotorCityCobra
Jan 25, 2024

From the Readme in ggerganov/examples/finetune/
"Only llama based models are supported!"

I think this time "llama" is literal and applies only to the llama models from Facebook.
I had the same experience. I tried a quantized Mistral 7B model. It seemed to do the fine tuning but when I every variation of what files are created after the fine tune is done in koboldcpp I get an error not on screen long enough before the koboldcpp.exe window closes.

Today I looked into modifying the code and holy lord, I am out of my depth. Have you looked at the ggml.h and finetune.cpp files?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does finetune support phi-2 #4995

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does finetune support phi-2 #4995

Uh oh!

candcconsulting Jan 17, 2024

Replies: 3 comments

Uh oh!

candcconsulting Jan 17, 2024 Author

Uh oh!

potaninmt Jan 19, 2024

Uh oh!

MotorCityCobra Jan 25, 2024

candcconsulting
Jan 17, 2024

candcconsulting
Jan 17, 2024
Author

potaninmt
Jan 19, 2024

MotorCityCobra
Jan 25, 2024