why CLBLAST
makes llama.cpp
slower on my system ?
#1950
Replies: 4 comments 4 replies
-
That's almost certainly the reason. Integrated GPUs tend to be pretty weak, and they usually share memory with the system. LLM inference has heavy memory bandwidth demands, so having the weak GPU competing with the part running on the CPU can result in more overhead than benefit. This is also why it's usually not worth it using more threads than you have real cores (via SMT, hyperthreading). From what I know, OpenCL (at least with llama.cpp) tends to be slower than CUDA when you can use it (which of course you can't). You basically need a reasonably powerful discrete GPU to take advantage of GPU offloading for LLM. However you might see benefits to compiling with CLBlast but not offloading GPU layers because BLAS can speed up prompt processing. This is something you'll only really notice with relatively large prompts. You'll have to do some testing to determine whether it's actually a performance advantage in your case. I have a Nvidia GTX 1060 (which is a pretty old card) and even with a decent GPU it's still worthwhile to use the GPU for prompt ingestion. However, offloading actual layers to the GPU is a performance loss for me. |
Beta Was this translation helpful? Give feedback.
-
The 1650 should be slower than a 1060. Offloading may be kind of borderline, but assuming the GPU is actually fast enough for it to be a performance benefit even offloading a couple layers should have a positive effect. When you were testing with those, you compiled with CUDA not OpenCL. Correct? |
Beta Was this translation helpful? Give feedback.
-
Hi, can you please try the latest clBLAST 1.6.1 where I have uploaded the latest tuning result for 5700G APU. More Info: |
Beta Was this translation helpful? Give feedback.
-
I'm also using an AMD integrated GPU (680m with the 7735HS). I noticed that the amount of FP16 compute was quite low and that most of the time data was being copied between GPU dedicated memory and CPU memory. If you change the Original: Replace with: Task manager shows less copying and higher FP16 compute utilisation. Additionally the prompts seem much faster, but are still slower than full CPU inference. However proper use of zero copy buffers according to the Intel documentation indicates that Sources: Build:
Results:
CPU BACKEND
CL_MEM_READ_WRITE
CL_MEM_READ_WRITE
CL_MEM_ALLOC_HOST_PTR
CL_MEM_ALLOC_HOST_PTR
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
llama.cpp compiled with CLBLAST gives very poor performance on my system when I store layers into the VRAM.
Any idea why ?
How many layers am I supposed to store in VRAM ?
My config :
llama.cpp compiled with
make LLAMA_CLBLAST=1
.Using
amdgpu-install --opencl=rocr
, I've managed to install AMD's proprietary OpenCL on this laptop.When I run
./main -m model/path
, text generation is relatively fast.When I run
./main -m model/path -ngl 35
, text generation is very slow.I tried various value for the
-ngl
argument, but it is always very slow.Installing or removing the
mesa-opencl-icd
package did not improve the performance.Is
CLBLAST
really supposed to makellama.cpp
faster ?What could explain it to be slower on my system ?
Is it because of the integrated GPU ?
What is your experience with CLBLAST ?
Beta Was this translation helpful? Give feedback.
All reactions