Prompt processing constantly faster on 8bit than 4bit (with/without OpenBLAS), how? #1619

Superskyyy · 2023-05-28T05:22:48Z

Superskyyy
May 28, 2023

Hi experts!
When I'm running the quantized versions of the Chinese Alpaca Plus model on Llama.cpp with --ins flag, I discovered that prompt processing is faster on a 8bit model than its 4bit version, below screenshot shows the 8bit version. (notice the prompt eval time, I made sure it's not a coincidence by running them multiple times interchangablly 4-8-4-8-8-4-8-4 with same prompt, and the result is consistently faster in 8bit model. ~70ms vs 95+ms/token)
./main -m ../ggml-model-q8_0.bin --color -ins -n 2048 -c 512 -t 16 -b 512 --mlock --temp 0.2 --repeat_penalty 1.3

As a comparison, below is from the 4 bit version, the generation performance is indeed better than 8 bit as expected, but not the prompt processing performance. The test is done several times and all flags and settings are kept the same, with/without mlock doesn't make a difference. My current guess is that 4bit operations are not as optimized as 8bit in modern cpus?
./main -m ../ggml-model-q4_0.bin --color -ins -n 2048 -c 512 -t 16 -b 512 --mlock --temp 0.2 --repeat_penalty 1.3

Thanks in advance for any clarification of this behavior.

More info:

It's running Chinese Alpaca 7b Plus. (I'm not sure if this is a model specific issue)
OpenBlas is enabled in the two screenshots above.
I'm running llama.cpp on a CPU only machine (16c64G) with no other programs running.

Another werid behavior is that when I run these two versions of models on a non-blas enabled llama.cpp, both of them show a better performance in prompt processing (8bit from 70+ms -> 50+, 4bit from 90+ -> 70+), indicating the blas isn't doing anything good. I really appreciate any help to understand this, as I'm new to the accelerator technologies it is not possible to figure it out myself.

Btw, is there a way to turn on detailed perf data for the operations carried out during prompt processing stage?

mqy · 2023-05-30T11:08:44Z

mqy
May 30, 2023

As of my benchmark, OpenBLAS is slightly slower than Accelerate (on macOS).

Also I found that the prompt eval time (especially for large token size) is highly sensitive to CPU temperature and system load,
plus the --mlock. I often get quite different prompt eval time with same prompt text, the difference may up to 50%+.
So when I want to run end to end bench with main, I often take care to:

let the machine idle and cool (at least not hot)
enable mlock
run for several time to warm up file system cache: we are still using mmap by default, I think.

The BLAS implementation in current master has a few shortcomings, you may want to have a look at my PR #1632 if you want to figure out some details.
The overall prompt eval time is heavily affected by mul_mat, as of my bench results, Q5_x looks a bit faster than others.

1 reply

Superskyyy May 31, 2023
Author

As of my benchmark, OpenBLAS is slightly slower than Accelerate (on macOS).

Also I found that the prompt eval time (especially for large token size) is highly sensitive to CPU temperature and system load, plus the --mlock. I often get quite different prompt eval time with same prompt text, the difference may up to 50%+. So when I want to run end to end bench with main, I often take care to:

let the machine idle and cool (at least not hot)

enable mlock

run for several time to warm up file system cache: we are still using mmap by default, I think.

The BLAS implementation in current master has a few shortcomings, you may want to have a look at my PR #1632 if you want to figure out some details. The overall prompt eval time is heavily affected by mul_mat, as of my bench results, Q5_x looks a bit faster than others.

Thank you for the explanation and all the dedication to improve the performance of mul_mat, the PR is impressive. I will be waiting for the PR to be ready and do some extra tests and feedback to the community here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prompt processing constantly faster on 8bit than 4bit (with/without OpenBLAS), how? #1619

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Prompt processing constantly faster on 8bit than 4bit (with/without OpenBLAS), how? #1619

Uh oh!

Uh oh!

Superskyyy May 28, 2023

Replies: 1 comment · 1 reply

Uh oh!

mqy May 30, 2023

Uh oh!

Superskyyy May 31, 2023 Author

Superskyyy
May 28, 2023

Replies: 1 comment 1 reply

mqy
May 30, 2023

Superskyyy May 31, 2023
Author