Is there any indication at compile-time or runtime whether cuBLAS or MMQ will be used? #8340
Replies: 5 comments 3 replies
-
Are you on linux? |
Beta Was this translation helpful? Give feedback.
-
@isaac-mcfadyen No, you are correct, it will use GPUs only if you set -ngl to whatever number of layers your model has, so ~ 35 for 7B. At run time llama-cli will print a log and you should see ... I am offloading 25 layers in this example: llm_load_tensors: offloading 25 repeating layers to GPU |
Beta Was this translation helpful? Give feedback.
-
I see in my case they are both set to no :) |
Beta Was this translation helpful? Give feedback.
-
Opened Issue at #8350 :) |
Beta Was this translation helpful? Give feedback.
-
I cannot think of anything useful except for Nsight-compute but it is a pain https://developer.nvidia.com/nsight-compute |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
As far as I know,
llama.cpp
picks between cuBLAS and MMQ based on int8 tensor core support on the GPU.There are optional compile flags to force either MMQ or cuBLAS. However, I haven't specified those flags and I can't find any indication which is actually being used.
Is this a case of me just missing the location where it's specified, or is it not indicated? And if it's not indicated, is this a possible feature that would be useful/should be added (if so I'll create an Issue)?
Beta Was this translation helpful? Give feedback.
All reactions