Qwen 30b.A3b IK/LCPP comparisons on lowspec machine #399

fizzAI · 2025-05-09T01:51:33Z

fizzAI
May 9, 2025

Hi! Recently (as in, I finished 5 minutes ago) I got curious as-to how fast my shitbox (for AI use anyways) can run.
Honestly, pretty fast! But the main thing here is the comparison between LCPP and IK_LCPP, and (un)surprisingly mainline LCPP gets pretty hosed.

Specs:

CPU: Ryzen 5 3500, 6 cores/~3.6ghz iirc
RAM: 16gb DDR4 at a max of 2667mhz (Yes, my motherboard sucks. Yes, I know.)
GPU: Nvidia GTX 1650 Super
VRAM: 4gb(!) of GDDR6

Here's the cherrypicked results that show each framework at their best -- both are running with -ot exps=CPU (with LCPP table slightly modified because they output different formats)

framework	model	size	params	backend	ngl	amb	fmoe	test	t/s
ik_llama.cpp	qwen3moe ?B IQ4_XS_R8 - 4.25 bpw	15.32 GiB	30.53 B	CUDA	99	512	1	pp512	15.82 ± 1.91
ik_llama.cpp	qwen3moe ?B IQ4_XS_R8 - 4.25 bpw	15.32 GiB	30.53 B	CUDA	99	512	1	tg128	3.05 ± 0.30
llama.cpp	qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.32 GiB	30.53 B	CUDA,BLAS	99	N/A	N/A	pp512	14.29 ± 0.05
llama.cpp	qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.32 GiB	30.53 B	CUDA,BLAS	99	N/A	N/A	tg128	2.75 ± 0.27

And here's the full log including the commands used and other random attempts

fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot exps=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |   amb | fmoe |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         pp512 |     15.72 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         tg128 |      2.86 ± 0.34 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         pp512 |     15.82 ± 1.91 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         tg128 |      3.05 ± 0.30 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         pp512 |     16.38 ± 1.32 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         tg128 |      2.78 ± 0.18 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         pp512 |     15.78 ± 1.96 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         tg128 |      2.89 ± 0.24 |

build: 4084ca73 (3673)

fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot ffn=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |   amb | fmoe |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         pp512 |     15.66 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         tg128 |      2.55 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         pp512 |     16.07 ± 1.94 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         tg128 |      2.86 ± 0.27 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         pp512 |     16.00 ± 1.77 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         tg128 |      2.63 ± 0.16 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         pp512 |     15.87 ± 2.01 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         tg128 |      2.74 ± 0.22 |

build: 4084ca73 (3673)

fizz@MAMMON:~$ llama.cpp/build/bin/llama-bench -fa 0,1 -ot exps=CPU -ngl 99 -m ~/ggufs/Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | threads | fa | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------------- | --------------: | -------------------: |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  0 | exps=CPU              |           pp512 |         14.29 ± 0.05 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  0 | exps=CPU              |           tg128 |          2.75 ± 0.27 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  1 | exps=CPU              |           pp512 |         11.80 ± 0.04 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  1 | exps=CPU              |           tg128 |          2.75 ± 0.36 |

build: 15e03282 (5318)

Some other interesting notes:

Memory wasn't the bottleneck here (at least not GPU memory), so I didn't really see any tangible benefits from FA -- however, I did test with it enabled, and LCPP's CPU FA is so slow it's not even funny
There's a bit of an uptick in performance without FA when amb is higher, but its faster for amb to be lower with FA. ???
I tried both exps=CPU (which I later found only offloads parts of the FFN to the CPU) and ffn=CPU (which offloads all of the FFN to the CPU as I was originally intending)... but it's slower to use the one which offloads the norms and stuff too! For some reason!
I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with Blis.

I still need to try dense models, CPU without offload, etc etc for this to be a fair comparison, but I hope this is still interesting data :)

VinnyG9 · 2025-05-14T12:05:43Z

VinnyG9
May 14, 2025

I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with Blis.

if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu, but not relevant unless you're using -nkvo ?

1 reply

ikawrakow May 14, 2025
Maintainer

if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu.

No, it does not. This is ik_llama.cpp not llama.cpp. I wrote the matrix multiplication implementation for almost all quants in llamafile and for all quants here, so I know that what I have here is faster than llamafile.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen 30b.A3b IK/LCPP comparisons on lowspec machine #399

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Qwen 30b.A3b IK/LCPP comparisons on lowspec machine #399

Uh oh!

fizzAI May 9, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

VinnyG9 May 14, 2025

Uh oh!

ikawrakow May 14, 2025 Maintainer

fizzAI
May 9, 2025

Replies: 1 comment 1 reply

VinnyG9
May 14, 2025

ikawrakow May 14, 2025
Maintainer