Prompt processing constantly faster on 8bit than 4bit (with/without OpenBLAS), how? #1619
Replies: 1 comment 1 reply
-
As of my benchmark, OpenBLAS is slightly slower than Accelerate (on macOS). Also I found that the prompt eval time (especially for large token size) is highly sensitive to CPU temperature and system load,
The BLAS implementation in current master has a few shortcomings, you may want to have a look at my PR #1632 if you want to figure out some details. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi experts!
When I'm running the quantized versions of the Chinese Alpaca Plus model on Llama.cpp with --ins flag, I discovered that prompt processing is faster on a 8bit model than its 4bit version, below screenshot shows the 8bit version. (notice the prompt eval time, I made sure it's not a coincidence by running them multiple times interchangablly 4-8-4-8-8-4-8-4 with same prompt, and the result is consistently faster in 8bit model. ~70ms vs 95+ms/token)
./main -m ../ggml-model-q8_0.bin --color -ins -n 2048 -c 512 -t 16 -b 512 --mlock --temp 0.2 --repeat_penalty 1.3
As a comparison, below is from the 4 bit version, the generation performance is indeed better than 8 bit as expected, but not the prompt processing performance. The test is done several times and all flags and settings are kept the same, with/without mlock doesn't make a difference. My current guess is that 4bit operations are not as optimized as 8bit in modern cpus?
./main -m ../ggml-model-q4_0.bin --color -ins -n 2048 -c 512 -t 16 -b 512 --mlock --temp 0.2 --repeat_penalty 1.3
Thanks in advance for any clarification of this behavior.
More info:
Another werid behavior is that when I run these two versions of models on a non-blas enabled llama.cpp, both of them show a better performance in prompt processing (8bit from 70+ms -> 50+, 4bit from 90+ -> 70+), indicating the blas isn't doing anything good. I really appreciate any help to understand this, as I'm new to the accelerator technologies it is not possible to figure it out myself.
Btw, is there a way to turn on detailed perf data for the operations carried out during prompt processing stage?
Beta Was this translation helpful? Give feedback.
All reactions