how to test Q4 models with the backend: "AMXInt8" or "AMXBF16" #1371

voipmonitor · 2025-06-07T09:52:57Z

voipmonitor
Jun 7, 2025

Hello,

I have recently tested ktransformers AMX support and the speed up is nice for the prefill.

In the documentation there is figure showing Model Qwen3-30B-A3B (4-bit) test reuslts but in the doc: "Qwen3MoE running with AMX can only read BF16 GGUF" obviously - reading any GGUF like Q4_K_M is not possible to use (or I'm doing something wrong)

how to try 4bit version with the AMX optimisations? Do I miss something?

This is how I run it (this works)
python -m ktransformers.server.main --architectures Qwen3MoeForCausalLM --model_path /root/models/Qwen/Qwen3-30B-A3B --gguf_path /root/models/unsloth/Qwen3-30B-A3B-GGUF/BF16 --optimize_config_path /root/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml --backend_type balance_serve --cache_lens 32768 --chunk_size 512 --max_batch_size 8 --model_name "unsloth/Qwen3-30B-A3B"

and this ends with error: assert self.gate_type == GGMLQuantizationType.BF16 (so I guess it needs the BF16 format but this means how to load 4bit quantisied model?

python -m ktransformers.server.main --architectures Qwen3MoeForCausalLM --model_path /root/models/unsloth/Qwen3-30B-A3B/ --gguf_path /mnt/models/Qwen/Qwen3-30B-A3B-Q4_K_M.gguf --optimize_config_path /root/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml --backend_type balance_serve --model_name "unsloth/Qwen3-30B-A3B"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to test Q4 models with the backend: "AMXInt8" or "AMXBF16" #1371

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

how to test Q4 models with the backend: "AMXInt8" or "AMXBF16" #1371

Uh oh!

voipmonitor Jun 7, 2025

Replies: 0 comments

voipmonitor
Jun 7, 2025