Skip to content

Conversation

@b4rtaz
Copy link
Owner

@b4rtaz b4rtaz commented Aug 17, 2025

This PR fixes a precision issue in multiplication for Qwen models on NVIDIA GPUs.

Performance tests

Tested on NVIDIA GeForce RTX 3060 12GB.

🌋 Device: NVIDIA GeForce RTX 3060                                                                                                                
🌋 DeviceApiVersion: 1.4.303                                                                                                                      
🌋 MaxComputeSharedMemory: 48 kB                                                                                                                  
🌋 NonCoherentAtomSize: 64 bytes                                                                                                                  
🌋 Heap[0]: 12288 MB                                                                                                                              
🌋 Heap[2]: 246 MB

Prediction (--steps 128)

Model Tokens/s (version 0.15.0) Tokens/s (version 0.15.2) Tokens/s (This PR)
qwen3_8b_q40 12.9 13.65 16.86
image

@b4rtaz b4rtaz merged commit 8909825 into main Aug 17, 2025
3 checks passed
@b4rtaz
Copy link
Owner Author

b4rtaz commented Aug 17, 2025

Another test on a similar setup. DL is 2.7x slower than llama.cpp.

image

llama.cpp

build/bin/llama-cli -m ../meta-llama-3.1-8b-instruct-q4_0.gguf\?download\=true -ngl 100 -c 16384 -t 10 -n -2 -cnv --prompt "The highest mountain on earth"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 3060 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
build: 1 (4d19698) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Vulkan0 (NVIDIA GeForce RTX 3060) - 12288 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 292 tensors from ../meta-llama-3.1-8b-instruct-q4_0.gguf?download=true (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 2
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:             llama.rope.scaling.attn_factor f32              = 1.000000
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type  f16:    2 tensors
llama_model_loader: - type q4_0:  224 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 5.61 GiB (6.01 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Meta Llama 3.1 8B Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:      Vulkan0 model buffer size =  4747.02 MiB
load_tensors:   CPU_Mapped model buffer size =  1002.00 MiB
...................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host  output buffer size =     0.49 MiB
llama_kv_cache_unified:    Vulkan0 KV buffer size =  2048.00 MiB
llama_kv_cache_unified: size = 2048.00 MiB ( 16384 cells,  32 layers,  1/1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context:    Vulkan0 compute buffer size =  1092.01 MiB
llama_context: Vulkan_Host compute buffer size =    44.01 MiB
llama_context: graph nodes  = 1126
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 10
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



system_info: n_threads = 10 (n_threads_batch = 10) / 48 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: interactive mode on.
sampler seed: 405221323
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 16384
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 16384, n_batch = 2048, n_predict = -2, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

user

The highest mountain on earthassistant

You're thinking of Mount Everest!

Yes, Mount Everest, also known as Chomolungma or Sagarmatha, is the highest mountain on Earth. It's located in the Himalayas on the border between Nepal and Tibet, China.

Here are some fascinating facts about Mount Everest:

1. **Height**: Mount Everest stands at an incredible 8,848 meters (29,029 feet) above sea level.
2. **Formation**: The mountain was formed around 60 million years ago when the Indian tectonic plate collided with the Eurasian plate, pushing the Earth's crust upwards to create the Himalayan mountain range.
3. **Climbing**: The first successful ascent of Mount Everest was made by Sir Edmund Hillary from New Zealand and Sherpa mountaineer Tenzing Norgay in 1953. Since then, many others have followed in their footsteps.
4. **Challenges**: Climbing Mount Everest is an extreme undertaking, requiring a tremendous amount of physical and mental endurance. The extreme altitude, harsh weather conditions, and steep terrain make it a significant challenge.
5. **Record holders**: As of 2022, the record holder for the fastest ascent of Mount Everest is Nirmal Purja, a Nepali mountaineer who completed the climb in 6 hours and 45 minutes.
6. **Environmental concerns**: The increasing popularity of Mount Everest has led to concerns about the impact on the environment, including litter, human waste, and the degradation of the mountain's fragile ecosystem.

The highest mountain on Earth is an awe-inspiring natural wonder that continues to captivate people around the world. Its sheer scale, beauty, and challenges have made it an iconic symbol of human adventure and exploration.

> 
llama_perf_sampler_print:    sampling time =      61.42 ms /   365 runs   (    0.17 ms per token,  5942.40 tokens per second)
llama_perf_context_print:        load time =    3184.08 ms
llama_perf_context_print: prompt eval time =      70.33 ms /    15 tokens (    4.69 ms per token,   213.28 tokens per second)
llama_perf_context_print:        eval time =    6572.91 ms /   349 runs   (   18.83 ms per token,    53.10 tokens per second)
llama_perf_context_print:       total time =    8836.89 ms /   364 tokens
llama_perf_context_print:    graphs reused =        337

distributed-llama

(main) root@C.25043970:/workspace/distributed-llama$ ./dllama inference --prompt "The highest mountain on earth" --steps 256 --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --nthreads 1 --gpu-index 0 --max-seq-len 4096
📄 AddBos: 79
📄 BosId: 128000 (<|begin_of_text|>)
📄 EosId: 128001 (<|end_of_text|>) 128009 (<|eot_id|>) 
📄 RegularVocabSize: 128000
📄 SpecialVocabSize: 256
💡 Arch: Llama
💡 HiddenAct: Silu
💡 Dim: 4096
💡 HeadDim: 128
💡 QDim: 4096
💡 KvDim: 1024
💡 HiddenDim: 14336
💡 VocabSize: 128256
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 OrigSeqLen: 131072
💡 SeqLen: 4096
💡 NormEpsilon: 0.000010
💡 RopeType: Llama3.1
💡 RopeTheta: 500000
💡 RopeScaling: f=8.0, l=1.0, h=4.0, o=8192
📀 RequiredMemory: 7345422 kB
🌋 Device: NVIDIA GeForce RTX 3060
🌋 DeviceApiVersion: 1.4.303
🌋 MaxComputeSharedMemory: 48 kB
🌋 NonCoherentAtomSize: 64 bytes
🌋 Heap[0]: 12288 MB
🌋 Heap[2]: 246 MB
💿 Loading weights...
💿 Weights loaded
The highest mountain on earth
🔷️ Eval  408 ms Sync    0 ms | Sent     0 kB Recv     0 kB | (5 tokens)
🔶 Pred   72 ms Sync    0 ms | Sent     0 kB Recv     0 kB | ,
🔶 Pred   50 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  Mount
🔶 Pred   50 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  Everest
🔶 Pred   50 ms Sync    0 ms | Sent     0 kB Recv     0 kB | ,
🔶 Pred   50 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  is
🔶 Pred   50 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  known
🔶 Pred   50 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  for
🔶 Pred   50 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  its
🔶 Pred   50 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  extreme
🔶 Pred   49 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  conditions
🔶 Pred   48 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  and
🔶 Pred   50 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  challenging
...
🔶 Pred   50 ms Sync    0 ms | Sent     0 kB Recv     0 kB | Ste
🔶 Pred   52 ms Sync    0 ms | Sent     0 kB Recv     0 kB | ep
🔶 Pred   52 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  terrain
🔶 Pred   52 ms Sync    0 ms | Sent     0 kB Recv     0 kB | :
🔶 Pred   52 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  The
🔶 Pred   53 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  terrain
🔶 Pred   52 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  on
🔶 Pred   52 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  Mount
🔶 Pred   52 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  Everest

Evaluation
   nBatches: 32
    nTokens: 5
   tokens/s: 12.23 (81.78 ms/tok)
Prediction
    nTokens: 251
   tokens/s: 19.37 (51.62 ms/tok)

@b4rtaz b4rtaz deleted the fix/vulkan-matmul-q80-q40-f32 branch August 18, 2025 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants