iq4_ks
performs great on gemma-3-27b-it-qat-q4_0-unquantized
#334
Replies: 10 comments 29 replies
-
This is QAT but unlike previous QAT models I have seen this was done with an additional stage of finetuning, which is why I think the raw PPL are less directly comparable to the version without QAT (and why they are lower as it was trained longer) but they should be still useful to compare given that you can often compare PPL within an architecture family, like the example below (a bit dated but I still find this graph made by ikawrakow interesting). |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
In my quick experiments with Gemma3-12B, the |
Beta Was this translation helpful? Give feedback.
-
Good question. If the
Is it so because it really is good, or is it more because the sentiment towards Google has shifted lately (at least when it comes to "AI"). My impression is that the Internet believes that the latest Gemini models are currently the best (and so, by extension, Gemma3 must be among the best open weight). But the few things that I asked Gemma3-12B where I have good knowledge of the subject matter, the answers were complete BS. |
Beta Was this translation helpful? Give feedback.
-
EDIT My compile script was messed up and putting me into DEBUG mode... I was doing some more testing benchmarking of various
I believe I'm compiling Release and not Debug but not completely sure how to tell. I'm not seeing that warning on mainline with roughly the same command ( I ran one of the bartowski quants on both mainline and 👈 Logsmodel="/mnt/raid/models/bartowski/google_gemma-3-27b-it-qat-GGUF/google_gemma-3-27b-it-qat-Q4_K_M.gguf"
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-ctk f16 -ctv f16 \
-c 32768 \
-ngl 99 \
--threads 16
llama_model_loader: loaded meta data with 44 key-value pairs and 808 tensors from /mnt/raid/models/bartowski/google_gemma-3-27b-it-qat-GGUF/google_gemma-3-27b-it-qat-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma 3 27b It Qat
llama_model_loader: - kv 3: general.finetune str = it-qat
llama_model_loader: - kv 4: general.basename str = gemma-3
llama_model_loader: - kv 5: general.size_label str = 27B
llama_model_loader: - kv 6: general.license str = gemma
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Gemma 3 27b It
llama_model_loader: - kv 9: general.base_model.0.organization str = Google
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv 11: general.tags arr[str,4] = ["gemma3", "gemma", "google", "image-...
llama_model_loader: - kv 12: gemma3.context_length u32 = 131072
llama_model_loader: - kv 13: gemma3.embedding_length u32 = 5376
llama_model_loader: - kv 14: gemma3.block_count u32 = 62
llama_model_loader: - kv 15: gemma3.feed_forward_length u32 = 21504
llama_model_loader: - kv 16: gemma3.attention.head_count u32 = 32
llama_model_loader: - kv 17: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 18: gemma3.attention.key_length u32 = 128
llama_model_loader: - kv 19: gemma3.attention.value_length u32 = 128
llama_model_loader: - kv 20: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 21: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 22: gemma3.attention.head_count_kv u32 = 16
llama_model_loader: - kv 23: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 24: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 25: tokenizer.ggml.model str = llama
llama_model_loader: - kv 26: tokenizer.ggml.pre str = default
llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 28: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 32: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 36: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 37: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 38: general.quantization_version u32 = 2
llama_model_loader: - kv 39: general.file_type u32 = 15
llama_model_loader: - kv 40: quantize.imatrix.file str = /models_out/gemma-3-27b-it-qat-GGUF/g...
llama_model_loader: - kv 41: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 42: quantize.imatrix.entries_count i32 = 434
llama_model_loader: - kv 43: quantize.imatrix.chunks_count i32 = 129
llama_model_loader: - type f32: 373 tensors
llama_model_loader: - type q4_K: 374 tensors
llama_model_loader: - type q6_K: 61 tensors
llm_load_vocab: special tokens cache size = 6415
llm_load_vocab: token to piece cache size = 1.9446 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = gemma3
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 262208
llm_load_print_meta: n_merges = 0
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 5376
llm_load_print_meta: n_layer = 62
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 16
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 1024
llm_load_print_meta: n_swa_pattern = 6
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 2
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 21504
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 0.125
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 27B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 27.009 B
llm_load_print_meta: model size = 15.404 GiB (4.899 BPW)
llm_load_print_meta: general.name = Gemma 3 27b It Qat
llm_load_print_meta: BOS token = 2 '<bos>'
llm_load_print_meta: EOS token = 1 '<eos>'
llm_load_print_meta: UNK token = 3 '<unk>'
llm_load_print_meta: PAD token = 0 '<pad>'
llm_load_print_meta: LF token = 248 '<0x0A>'
llm_load_print_meta: EOT token = 106 '<end_of_turn>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.70 MiB
llm_load_tensors: offloading 62 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 63/63 layers to GPU
llm_load_tensors: CPU buffer size = 1102.77 MiB
llm_load_tensors: CUDA0 buffer size = 15773.97 MiB
.........................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 0.125
llama_kv_cache_init: CUDA0 KV buffer size = 15872.00 MiB
llama_new_context_with_model: KV self size = 15872.00 MiB, K (f16): 7936.00 MiB, V (f16): 7936.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 522.62 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 138.51 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 522.62 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 138.51 MiB
llama_new_context_with_model: graph nodes = 1806
llama_new_context_with_model: graph splits = 2
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 16, n_threads_batch = 16
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
ggml_backend_cuda_graph_compute: CUDA graph update failed
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [sa_out-0] [5376 512 1 1]
| 512 | 128 | 0 | 0.356 | 1436.25 | 3.719 | 34.42 |
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [sa_out-0] [5376 512 1 1]
| 512 | 128 | 512 | 0.372 | 1378.12 | 3.782 | 33.85 |
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [sa_out-0] [5376 512 1 1]
.
.
. EDIT Updated Graph with |
Beta Was this translation helpful? Give feedback.
-
I was clearly confused. This is Gemma3. |
Beta Was this translation helpful? Give feedback.
-
The PP performance difference between mainline and Are you sure your |
Beta Was this translation helpful? Give feedback.
-
It is related, but not really the same. With
If you see such messages, you are running in debug mode. |
Beta Was this translation helpful? Give feedback.
-
@ubergarm Thanks for this iq4_ks quant, it works super. PPL Gemma 27b it q8_0 : Final estimate: PPL = 12.8797 +/- 0.12932 |
Beta Was this translation helpful? Give feedback.
-
More. On llama-perplexity -m E:\text-generation-webui\models\google_gemma-3-4b-it-qat-q4_0-unquantized_CHOSENQUANT.gguf -f wiki.test.raw -fa -mg 0 -ngl 150 -ts 40,0,0 -b 512 --no-mmap -c 512 BF16: PPL = 15.1898 +/- 0.14353 No comment! ^^ Note : I quantized with my fork of Llama.cpp mainline b5588 including the IQ_K quants. https://github.com/Nexesenex/croco.cpp/tree/NXS_Llama.cpp Reminder :
Edit : for Gemma 3 27b qat q4_0 unquantized bf16 to pure iq4_xs with ubergarm's Imatrix : Final estimate: PPL = 8.2903 +/- 0.06439 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
EDIT: Just uploaded the
ik_llama.cpp
exclusive quants for best quality in minimum VRAM to huggingface ubergarm/gemma-3-27b-it-qat-GGUF.I saw google released their google/gemma-3-27b-it-qat-q4_0-unquantized original
.safetensors
unquantized model. It is supposedly designed forq4_0
quantization which was released earlier in gguf format.I used mainline to convert the
.safetensors
tobf16
and then usedik_llama.cpp
to cook some quants to compare size and perplexity. Here are the results which interestingly suggestik4_ks
has lower perplexity than the originalbf16
(theq8_0
does too)!Raw Data
Perplexity
Sweep Bench
Methodology
Perplexity
Sweep Bench
Using a single RTX A6000 48GB VRAM GPU
I tried a number of other mixes based on
--layer-similarity
scores trying to optimize entire layer score, and also based on attn and ffn scores, but in limited testing on this specific model it didn't prove to provide better perplexity. My impression is this QAT was indeed meant to beq4_0
as sometimes using a mix of slightly higher quants for some layers showed slightly worse perplexity.I didn't compare against non-QAT bf16 quants, but wanted to share some early results with anyone else curious about this QAT business.
Cheers!
Beta Was this translation helpful? Give feedback.
All reactions