why `CLBLAST` makes `llama.cpp` slower on my system ? #1950

SuperUserNameMan · 2023-06-20T14:49:19Z

SuperUserNameMan
Jun 20, 2023

Hello,

llama.cpp compiled with CLBLAST gives very poor performance on my system when I store layers into the VRAM.

Any idea why ?
How many layers am I supposed to store in VRAM ?

My config :

OS : Linux Mint 21.1 Vera (64bits)
CPU : AMD Ryzen 5 5500u (6 cores, 12 threads)
GPU : integrated Radeon GPU
RAM : 16 GB
OpenCL platform : AMD Accelerated Parallel Processing
OpenCL device : gfx90c:xnack-

llama.cpp compiled with make LLAMA_CLBLAST=1.

Using amdgpu-install --opencl=rocr, I've managed to install AMD's proprietary OpenCL on this laptop.

When I run ./main -m model/path, text generation is relatively fast.

llama.cpp: loading model from ../models/ggml-vic7b-uncensored-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0,07 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  = 6612,59 MB (+ 1026,00 MB per state)
llama_model_load_internal: offloading 0 repeating layers to GPU
llama_model_load_internal: offloaded 0/35 layers to GPU
llama_model_load_internal: total VRAM used: 0 MB
...................................................................................................
llama_init_from_file: kv self size  =  256,00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

When I run ./main -m model/path -ngl 35, text generation is very slow.

llama.cpp: loading model from ../models/ggml-vic7b-uncensored-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0,07 MB
llama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  = 1979,58 MB (+ 1026,00 MB per state)
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 5660 MB
...................................................................................................
llama_init_from_file: kv self size  =  256,00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

I tried various value for the -ngl argument, but it is always very slow.

Installing or removing the mesa-opencl-icd package did not improve the performance.

Is CLBLAST really supposed to make llama.cpp faster ?

What could explain it to be slower on my system ?
Is it because of the integrated GPU ?

What is your experience with CLBLAST ?

KerfuffleV2 · 2023-06-20T15:36:52Z

KerfuffleV2
Jun 20, 2023
Collaborator

Is it because of the integrated GPU ?

That's almost certainly the reason. Integrated GPUs tend to be pretty weak, and they usually share memory with the system. LLM inference has heavy memory bandwidth demands, so having the weak GPU competing with the part running on the CPU can result in more overhead than benefit. This is also why it's usually not worth it using more threads than you have real cores (via SMT, hyperthreading).

From what I know, OpenCL (at least with llama.cpp) tends to be slower than CUDA when you can use it (which of course you can't).

You basically need a reasonably powerful discrete GPU to take advantage of GPU offloading for LLM. However you might see benefits to compiling with CLBlast but not offloading GPU layers because BLAS can speed up prompt processing. This is something you'll only really notice with relatively large prompts.

You'll have to do some testing to determine whether it's actually a performance advantage in your case. I have a Nvidia GTX 1060 (which is a pretty old card) and even with a decent GPU it's still worthwhile to use the GPU for prompt ingestion. However, offloading actual layers to the GPU is a performance loss for me.

2 replies

SuperUserNameMan Jun 20, 2023
Author

I also tested with a GTX 1060 (3GB) on my desktop computer, and I also found no benefits regarding -ngl. ( I thought it was because of the limited 3GB of VRAM ... ).
Same with GTX 1650 (4GB).

Fortunately, vanilla and OPENBLAS llama.cpp performance is relatively and surprisingly good on a 6 core Ryzen 5 laptop CPU.

ghost Jun 20, 2023

There have been multiple reports (including my own) where prompt processing with CLBlast was slower on iGPU compared to CPU. It's always best to experiment with the different combinations (OpenBLAS, CLBlast, CUDA, different ngl options) to find the one that gives the fastest results on your computer.

KerfuffleV2 · 2023-06-20T17:15:18Z

KerfuffleV2
Jun 20, 2023
Collaborator

I also tested with a GTX 1060 (3GB) on my desktop computer, [...] Same with GTX 1650 (4GB).

The 1650 should be slower than a 1060. Offloading may be kind of borderline, but assuming the GPU is actually fast enough for it to be a performance benefit even offloading a couple layers should have a positive effect.

When you were testing with those, you compiled with CUDA not OpenCL. Correct?

2 replies

SuperUserNameMan Jun 21, 2023
Author

When you were testing with those, you compiled with CUDA not OpenCL. Correct?

Yep, i compiled with CUBLAS on Nvidia GPUs.
Peformances are as slow as with CLBLAST with the Radeon igp when -ngl is greater than 0.

I noticed no gain compared to with LLAMA_OPENBLAS=1.

Maybe because my CPUs are good enough ?

Ryzen 5 4600H (6 cores / 12 threads )
Ryzen 5 5500u (6 cores / 12 threads)
Core i3 12100f (4 cores / 8 threads)

They all show similar performances in multi-threading benchmarks and using llama.cpp + OPENBLAS.

./main -m ../models/ggml-vic7b-uncensored-q5_1.bin -p "Hello my name is" -n 256

OPENBLAS

llama_print_timings:        load time =   360,41 ms
llama_print_timings:      sample time =   207,95 ms /   256 runs   (    0,81 ms per token,  1231,06 tokens per second)
llama_print_timings: prompt eval time =   391,94 ms /     5 tokens (   78,39 ms per token,    12,76 tokens per second)
llama_print_timings:        eval time = 35649,36 ms /   255 runs   (  139,80 ms per token,     7,15 tokens per second)
llama_print_timings:       total time = 36313,16 ms

CLBLAST + -ngl 0

llama_print_timings:        load time =   297,56 ms
llama_print_timings:      sample time =   216,89 ms /   256 runs   (    0,85 ms per token,  1180,34 tokens per second)
llama_print_timings: prompt eval time =   397,20 ms /     5 tokens (   79,44 ms per token,    12,59 tokens per second)
llama_print_timings:        eval time = 34704,43 ms /   255 runs   (  136,10 ms per token,     7,35 tokens per second)
llama_print_timings:       total time = 35383,77 ms

CLBLAST + -ngl 4

llama_print_timings:        load time = 13327,80 ms
llama_print_timings:      sample time =   330,95 ms /   256 runs   (    1,29 ms per token,   773,52 tokens per second)
llama_print_timings: prompt eval time =  1670,47 ms /     5 tokens (  334,09 ms per token,     2,99 tokens per second)
llama_print_timings:        eval time = 104504,44 ms /   255 runs   (  409,82 ms per token,     2,44 tokens per second)
llama_print_timings:       total time = 106608,63 ms

(I don't have access to the Nvidia GPU currently, but performance as poor as CLBLAST + -ngl 4)

KerfuffleV2 Jun 21, 2023
Collaborator

I noticed no gain compared to with LLAMA_OPENBLAS=1.

Like I mentioned, using BLAS without GPU offloading is only going to speed up prompt processing and then only if the prompt is fairly large. Your prompt was 5 tokens in those examples. The default batch size (-b) is 512 tokens so prompts smaller than that wouldn't use BLAS I think. I think the threshold for BLAS being used for prompts is 32 tokens, so you could possibly have it use BLAS for the prompt with -b 32 as long as the prompt was at least 32 tokens.

So basically if your prompt is a few paragraphs of text, that's where you may notice a difference with BLAS but no GPU offloading. Also, with my own tests OpenBLAS didn't really help there. So you may only notice this if you're using CLBlast or cuBLAS.

For me at least, using cuBLAS speeds up prompt processing about 10x - and I have a pretty old GPU, a GTX 1060 6GB. My CPU is decent though, a Ryzen 9 5900X.

tangjinchuan · 2023-08-17T14:03:54Z

tangjinchuan
Aug 17, 2023

Hi, can you please try the latest clBLAST 1.6.1 where I have uploaded the latest tuning result for 5700G APU. More Info:
CNugteren/CLBlast#1
#1688

0 replies

williamyang98 · 2023-11-10T09:58:13Z

williamyang98
Nov 10, 2023

I'm also using an AMD integrated GPU (680m with the 7735HS). I noticed that the amount of FP16 compute was quite low and that most of the time data was being copied between GPU dedicated memory and CPU memory. If you change the ggml_cl_pool_malloc code in ggml-opencl.cpp to use CL_MEM_ALLOC_HOST_PTR instead of CL_MEM_READ_WRITE it'll use shared GPU memory which reduces the amount of copying and produces higher FP16 utilisation.

Original:
CL_CHECK((mem = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &err), err));
https://github.com/ggerganov/llama.cpp/blob/a75fa576abba9d37f463580c379e4bbf1e1ad03c/ggml-opencl.cpp#L1314C32-L1314C32

Replace with:
CL_CHECK((mem = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, size, NULL, &err), err));

Task manager shows less copying and higher FP16 compute utilisation. Additionally the prompts seem much faster, but are still slower than full CPU inference. However proper use of zero copy buffers according to the Intel documentation indicates that clEnqueueMapBuffer and clEnqueueUnmapMemObject should be used. Further modifications and profiling is required to determine where performance limitations are for best use of OpenCL with integrated GPUs.

Sources:
https://www.intel.com/content/www/us/en/developer/articles/training/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics.html
https://registry.khronos.org/OpenCL/sdk/1.0/docs/man/xhtml/clCreateBuffer.html

Build:

Compilation was done on Windows 11 with vcpkg.
x64 development environment Visual Studio C++
vcpkg.json included clblast 1.6.1

cmake . -B build -G Ninja -DLLAMA_CLBLAST=ON\
 -DCMAKE_BUILD_TYPE=Release\
 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++\
 -DCMAKE_TOOLCHAIN_FILE=/c/tools/vcpkg/scripts/buildsystems/vcpkg.cmake

Results:
mistral-7b-instruct-v0.1.Q4_K_M.gguf has 33 layers that can be offloaded to GPU. Setting the number of layers too high will result in over allocation of dedicated VRAM which causes parts of the model to be continually copied in and out (only applies when using CL_MEM_READ_WRITE)

./build/bin/main.exe -m ../privateGPT/models/mistral-7b-instruct-v0.1.Q4_K_M.gguf -p "Write me a 1000 line poem about ducks. " -n 512 -s 1234567890

CPU BACKEND -ngl 0

llama_print_timings:        load time =     625.92 ms
llama_print_timings:      sample time =      54.37 ms /   431 runs   (    0.13 ms per token,  7927.75 tokens per second)
llama_print_timings: prompt eval time =     579.91 ms /    16 tokens (   36.24 ms per token,    27.59 tokens per second)
llama_print_timings:        eval time =   48352.67 ms /   430 runs   (  112.45 ms per token,     8.89 tokens per second)
llama_print_timings:       total time =   49160.79 ms

CL_MEM_READ_WRITE -ngl 16
Copy: 12%, Compute 1 (fp16): 40%, Dedicated VRAM: 2.7GB, Shared VRAM: 0.0GB

llama_print_timings:        load time =   12582.01 ms
llama_print_timings:      sample time =      39.17 ms /   350 runs   (    0.11 ms per token,  8936.09 tokens per second)
llama_print_timings: prompt eval time =    4435.05 ms /    16 tokens (  277.19 ms per token,     3.61 tokens per second)
llama_print_timings:        eval time =   69581.76 ms /   349 runs   (  199.37 ms per token,     5.02 tokens per second)
llama_print_timings:       total time =   74183.05 ms

CL_MEM_READ_WRITE -ngl 33
Copy: 40%, Compute 1 (fp16): 15%, Dedicated VRAM: 3.4GB, Shared VRAM: 0.0GB

llama_print_timings:        load time =   22690.80 ms
llama_print_timings:      sample time =     102.35 ms /   215 runs   (    0.48 ms per token,  2100.68 tokens per second)
llama_print_timings: prompt eval time =    8527.55 ms /    15 tokens (  568.50 ms per token,     1.76 tokens per second)
llama_print_timings:        eval time =  238376.70 ms /   214 runs   ( 1113.91 ms per token,     0.90 tokens per second)
llama_print_timings:       total time =  247151.80 ms

CL_MEM_ALLOC_HOST_PTR -ngl 16
Copy: 0%, Compute 1 (fp16): 50%, Dedicated VRAM: 0.4GB, Shared VRAM: 2.2GB

llama_print_timings:        load time =   12534.93 ms
llama_print_timings:      sample time =      39.43 ms /   350 runs   (    0.11 ms per token,  8875.59 tokens per second)
llama_print_timings: prompt eval time =    4752.41 ms /    16 tokens (  297.03 ms per token,     3.37 tokens per second)
llama_print_timings:        eval time =   50276.53 ms /   349 runs   (  144.06 ms per token,     6.94 tokens per second)
llama_print_timings:       total time =   55195.69 ms

CL_MEM_ALLOC_HOST_PTR -ngl 33
Copy: 0%, Compute 1 (fp16): 80%, Dedicated VRAM: 0.4GB, Shared VRAM: 4.5GB

llama_print_timings:        load time =   20339.39 ms
llama_print_timings:      sample time =      55.19 ms /   409 runs   (    0.13 ms per token,  7411.30 tokens per second)
llama_print_timings: prompt eval time =    8678.80 ms /    16 tokens (  542.43 ms per token,     1.84 tokens per second)
llama_print_timings:        eval time =   54316.36 ms /   408 runs   (  133.13 ms per token,     7.51 tokens per second)
llama_print_timings:       total time =   63211.60 ms

0 replies

why CLBLAST makes llama.cpp slower on my system ? #1950

Uh oh!

SuperUserNameMan Jun 20, 2023

Replies: 4 comments · 4 replies

Uh oh!

KerfuffleV2 Jun 20, 2023 Collaborator

Uh oh!

Uh oh!

SuperUserNameMan Jun 20, 2023 Author

Uh oh!

Uh oh!

ghost Jun 20, 2023

Uh oh!

KerfuffleV2 Jun 20, 2023 Collaborator

Uh oh!

SuperUserNameMan Jun 21, 2023 Author

Uh oh!

KerfuffleV2 Jun 21, 2023 Collaborator

Uh oh!

tangjinchuan Aug 17, 2023

Uh oh!

Uh oh!

williamyang98 Nov 10, 2023

why `CLBLAST` makes `llama.cpp` slower on my system ? #1950

SuperUserNameMan
Jun 20, 2023

Replies: 4 comments 4 replies

KerfuffleV2
Jun 20, 2023
Collaborator

SuperUserNameMan Jun 20, 2023
Author

KerfuffleV2
Jun 20, 2023
Collaborator

SuperUserNameMan Jun 21, 2023
Author

KerfuffleV2 Jun 21, 2023
Collaborator

tangjinchuan
Aug 17, 2023

williamyang98
Nov 10, 2023