Massive slowdown on Linux #8582

MrJackSpade · 2024-07-19T04:33:27Z

MrJackSpade
Jul 19, 2024

I'm using the same commit of Llama.cpp on a Linux machine, and a Windows machine.

Both machines have DDR4 memory. I've tested the memory speeds, and the Windows machine is ~35,000MiB and the Linux machine is ~30,000MiB

For some reason though, the Windows machine is running ~4x faster than the Linux machine using the same(ish) settings. The linux machine used to be windows, and as far as I remember it ran about the same speed as the current windows machine, which makes sense because the memory speeds are about the same.

Both have CUDA support compiled in for CUBLAS, however both are running 0 layers offloaded to GPU

This model I'm using to test is L3-8B-Celeste-v1.Q8_0.gguf

This is the Linux machine

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 8192, n_batch = 2048, n_predict = 16, n_keep = 1


Volkswagen ID.4
Volkswagen ID.4 - a compact
llama_print_timings:        load time =    3932.60 ms
llama_print_timings:      sample time =       7.01 ms /    16 runs   (    0.44 ms per token,  2281.15 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   13804.85 ms /    16 runs   (  862.80 ms per token,     1.16 tokens per second)
llama_print_timings:       total time =   13835.46 ms /    16 tokens
Log end

This is the Windows machine

system_info: n_threads = 4 / 24 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 2048, n_predict = 16, n_keep = 1


I am a writer, editor and content strategist with over 10 years of experience
llama_print_timings:        load time =    2137.08 ms
llama_print_timings:      sample time =       1.30 ms /    16 runs   (    0.08 ms per token, 12336.16 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (-nan(ind) ms per token, -nan(ind) tokens per second)
llama_print_timings:        eval time =    3442.71 ms /    16 runs   (  215.17 ms per token,     4.65 tokens per second)
llama_print_timings:       total time =    3448.57 ms /    16 tokens
Log end

The machine specs themselves are pretty different, with the Windows machine having a 3090 and a 5900x while the Linux machine is a laptop with a 3080m and an i7-11800H, but since its pure CPU inference I'm of the understanding that it should be the RAM speed that really determines the inference speed, right?

I swear to god I remember getting approx the same speed on pure CPU before moving one of the machines over to Linux. Maybe not the exact same, but not a 4x difference!

dspasyuk · 2024-07-20T15:57:25Z

dspasyuk
Jul 20, 2024

@MrJackSpade Hm interesting, I do not seem to see this with the current version of llama.cpp and Ryzen 3700x what Linux are you using? Is it possible that Linux missing some drivers for your CPU?

1 reply

MrJackSpade Jul 21, 2024
Author

So I ended up figuring out what it was after a ton of work, and I'm assuming its my fault.

Basically what was happening was that the Windows build was being built in "release", but the Linux build was being built in debug.

Whats confusing about that is that I was using the same configurations for both the Windows and Linux builds. I had created a new configuration that inherited from "release" and when I selected that as the build configuration in windows it worked, however when using the same build configuration under Linux it compiled in debug for some reason.

I switched to just calling the base "Release" itself and passing in all the required parameters on the CLI instead of trying to create a preset for it.

I'm assuming thats just my own misunderstanding about how building with cmake on Linux works, because as I said it worked perfectly fine for Windows honoring the build type and all the flags.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Massive slowdown on Linux #8582

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Massive slowdown on Linux #8582

Uh oh!

Uh oh!

MrJackSpade Jul 19, 2024

Replies: 1 comment · 1 reply

Uh oh!

dspasyuk Jul 20, 2024

Uh oh!

MrJackSpade Jul 21, 2024 Author

MrJackSpade
Jul 19, 2024

Replies: 1 comment 1 reply

dspasyuk
Jul 20, 2024

MrJackSpade Jul 21, 2024
Author