Is there any indication at compile-time or runtime whether cuBLAS or MMQ will be used? #8340

isaac-mcfadyen · 2024-07-06T20:50:33Z

isaac-mcfadyen
Jul 6, 2024

As far as I know, llama.cpp picks between cuBLAS and MMQ based on int8 tensor core support on the GPU.

There are optional compile flags to force either MMQ or cuBLAS. However, I haven't specified those flags and I can't find any indication which is actually being used.

Is this a case of me just missing the location where it's specified, or is it not indicated? And if it's not indicated, is this a possible feature that would be useful/should be added (if so I'll create an Issue)?

dspasyuk · 2024-07-07T14:31:45Z

dspasyuk
Jul 7, 2024

Are you on linux?
Have you tried:
ldd ./llama-cli but if you build it without Cuda you will not get it, so CPU only, I am not sure cuBLAS is still supported.
You can rebuild with Cuda support: make clean && LLAMA_CUDA=1 make -j 6 or make clean && LLAMA_CUBLAS=1 make -j 6
You will also need to add -ngl 35 to store layers in VRAM so:
./llama.cpp/llama-cli -ngl 25 -m ../../models/Meta-Llama-3-8B-Instruct_Q4_K_S.gguf -p "What is 2 + 2"

1 reply

isaac-mcfadyen Jul 7, 2024
Author

I am on Linux, yup.

ldd shows it linking to cuBLAS - but I don't think that means it's actually choosing cuBLAS at runtime (I could be wrong, but given I'm on an RTX 4090 with int8 support I believe it would use MMQ).

By MMQ vs cuBLAS I mean what's actually used at runtime, not what's linked by the binary: see this PR #8075

dspasyuk · 2024-07-07T14:45:53Z

dspasyuk
Jul 7, 2024

@isaac-mcfadyen No, you are correct, it will use GPUs only if you set -ngl to whatever number of layers your model has, so ~ 35 for 7B. At run time llama-cli will print a log and you should see ... I am offloading 25 layers in this example:

llm_load_tensors: offloading 25 repeating layers to GPU
llm_load_tensors: offloaded 25/33 layers to GPU
llm_load_tensors: CPU buffer size = 4467.80 MiB
llm_load_tensors: CUDA0 buffer size = 2925.78 MiB

1 reply

isaac-mcfadyen Jul 7, 2024
Author

@dspasyuk yes that's definitely using CUDA, but I'm specifically talking about CUDA with MMQ versus CUDA with cuBLAS.

In your example you are offloading layers to CUDA but there's no indication whether llama.cpp is using MMQ or cuBLAS (two different ways of running with CUDA, which one is used is chosen automatically by llama.cpp).

dspasyuk · 2024-07-07T14:54:57Z

dspasyuk
Jul 7, 2024

I see in my case they are both set to no :)
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no

1 reply

isaac-mcfadyen Jul 7, 2024
Author

Those would be the flags to force them to on.

Without those flags it picks automatically - and as far as I can tell, there is no indication which is picked, so I might create a Github issue/feature request for an indicator to be added.

isaac-mcfadyen · 2024-07-07T15:03:01Z

isaac-mcfadyen
Jul 7, 2024
Author

Opened Issue at #8350 :)

0 replies

dspasyuk · 2024-07-07T15:04:50Z

dspasyuk
Jul 7, 2024

I cannot think of anything useful except for Nsight-compute but it is a pain https://developer.nvidia.com/nsight-compute

0 replies

Is there any indication at compile-time or runtime whether cuBLAS or MMQ will be used? #8340

Uh oh!

isaac-mcfadyen Jul 6, 2024

Replies: 5 comments · 3 replies

Uh oh!

Uh oh!

dspasyuk Jul 7, 2024

Uh oh!

isaac-mcfadyen Jul 7, 2024 Author

Uh oh!

Uh oh!

dspasyuk Jul 7, 2024

Uh oh!

isaac-mcfadyen Jul 7, 2024 Author

Uh oh!

dspasyuk Jul 7, 2024

Uh oh!

isaac-mcfadyen Jul 7, 2024 Author

Uh oh!

isaac-mcfadyen Jul 7, 2024 Author

Uh oh!

dspasyuk Jul 7, 2024

isaac-mcfadyen
Jul 6, 2024

Replies: 5 comments 3 replies

dspasyuk
Jul 7, 2024

isaac-mcfadyen Jul 7, 2024
Author

dspasyuk
Jul 7, 2024

isaac-mcfadyen Jul 7, 2024
Author

dspasyuk
Jul 7, 2024

isaac-mcfadyen Jul 7, 2024
Author

isaac-mcfadyen
Jul 7, 2024
Author

dspasyuk
Jul 7, 2024