how to test Q4 models with the backend: "AMXInt8" or "AMXBF16" #1371
Unanswered
voipmonitor
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I have recently tested ktransformers AMX support and the speed up is nice for the prefill.
In the documentation there is figure showing Model Qwen3-30B-A3B (4-bit) test reuslts but in the doc: "Qwen3MoE running with AMX can only read BF16 GGUF" obviously - reading any GGUF like Q4_K_M is not possible to use (or I'm doing something wrong)
how to try 4bit version with the AMX optimisations? Do I miss something?
This is how I run it (this works)
python -m ktransformers.server.main --architectures Qwen3MoeForCausalLM --model_path /root/models/Qwen/Qwen3-30B-A3B --gguf_path /root/models/unsloth/Qwen3-30B-A3B-GGUF/BF16 --optimize_config_path /root/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml --backend_type balance_serve --cache_lens 32768 --chunk_size 512 --max_batch_size 8 --model_name "unsloth/Qwen3-30B-A3B"
and this ends with error: assert self.gate_type == GGMLQuantizationType.BF16 (so I guess it needs the BF16 format but this means how to load 4bit quantisied model?
python -m ktransformers.server.main --architectures Qwen3MoeForCausalLM --model_path /root/models/unsloth/Qwen3-30B-A3B/ --gguf_path /mnt/models/Qwen/Qwen3-30B-A3B-Q4_K_M.gguf --optimize_config_path /root/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve-amx.yaml --backend_type balance_serve --model_name "unsloth/Qwen3-30B-A3B"
Beta Was this translation helpful? Give feedback.
All reactions