How to properly use llama.cpp with multiple NVIDIA GPUs with different CUDA compute engine versions? #8725

tigran123 · 2024-07-27T14:00:15Z

tigran123
Jul 27, 2024

I have an RTX 2080 Ti 11GB and TESLA P40 24GB in my machine. First of all, when I try to compile llama.cpp I am asked to set CUDA_DOCKER_ARCH accordingly. But according to what -- RTX 2080 Ti (7.5) or P40 (6.1)? When it was compiled with CUDA_DOCKER_ARCH=compute_75 it would fail to load the model. Now I compiled it with CUDA_DOCKER_ARCH=compute_61 and it seems to load, but does it mean it is now much slower than it could be, because it is using an older compute engine version on the RTX 2080 Ti?

UPDATE: it only "seems to load" if the values of -ngl N is low enough to fit into the first GPU (RTX 2080 Ti). With higher values it fails.

Answered by dspasyuk

Jul 31, 2024

@tigran123 Open Additional Driver setting dialog, should look like this and install any non open driver above 525.

View full answer

tigran123 · 2024-07-29T13:35:00Z

tigran123
Jul 29, 2024
Author

If I load Meta-Llama-3-8B-Instruct.f16.gguf it works fine and seems to use both GPUs. But trying to load Meta-Llama-3-70B-Instruct.f16.gguf results in this crash:

$ ~/Software/AI/llama.cpp/llama-server --host 192.168.1.4 -m /data/models/Meta-Llama-3-70B-Instruct.f16.gguf -t 12 -ngl 16
....
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  2048.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   352.00 MiB
llama_new_context_with_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  2270.50 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =  1104.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    32.01 MiB
llama_new_context_with_model: graph nodes  = 2566
llama_new_context_with_model: graph splits = 709
CUDA error: CUBLAS_STATUS_NOT_INITIALIZED
  current device: 0, in function cublas_handle at ggml/src/ggml-cuda/common.cuh:826
  cublasCreate_v2(&cublas_handles[device])
ggml/src/ggml-cuda.cu:101: CUDA error
[New LWP 1493]
[New LWP 1494]
[New LWP 1495]
[New LWP 1496]
[New LWP 1500]
[New LWP 1501]
[New LWP 1502]
[New LWP 1503]
[New LWP 1504]
[New LWP 1505]
[New LWP 1506]
[New LWP 1507]
[New LWP 1508]
[New LWP 1509]
[New LWP 1510]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f77abeea42f in __GI___wait4 (pid=1511, stat_loc=stat_loc@entry=0x7ffd079d49f4, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007f77abeea42f in __GI___wait4 (pid=1511, stat_loc=stat_loc@entry=0x7ffd079d49f4, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007f77abeea3ab in __GI___waitpid (pid=<optimised out>, stat_loc=stat_loc@entry=0x7ffd079d49f4, options=options@entry=0) at ./posix/waitpid.c:38
38	./posix/waitpid.c: No such file or directory.
#2  0x000055a63de5918a in ggml_print_backtrace () at ggml/src/ggml.c:179
179	        waitpid(pid, &wstatus, 0);
#3  ggml_abort (file=0x55a63e09f867 "ggml/src/ggml-cuda.cu", line=101, fmt=0x55a63e0968bf "CUDA error") at ggml/src/ggml.c:206
206	    ggml_print_backtrace();
#4  0x000055a63dc75ec6 in ggml_cuda_error (stmt=stmt@entry=0x55a63e097790 "cublasCreate_v2(&cublas_handles[device])", func=func@entry=0x55a63e096aaf "cublas_handle", file=file@entry=0x55a63e096917 "ggml/src/ggml-cuda/common.cuh", line=line@entry=826, msg=0x55a63e096796 "CUBLAS_STATUS_NOT_INITIALIZED") at ggml/src/ggml-cuda.cu:101
101	    GGML_ABORT("CUDA error");
#5  0x000055a63dc79561 in ggml_backend_cuda_context::cublas_handle (device=0, this=0x55a63f6ec230) at ggml/src/ggml-cuda/common.cuh:826
826	            CUBLAS_CHECK(cublasCreate(&cublas_handles[device]));
#6  ggml_cuda_op_mul_mat_cublas (ctx=..., src0=src0@entry=0x55a64a866b00, src1=src1@entry=0x55a63fabdb20, dst=dst@entry=0x55a63fabdc90, src0_dd_i=src0_dd_i@entry=0x7f546c008000 "", src1_ddf_i=src1_ddf_i@entry=0x7f4eda010000, src1_ddq_i=0x0, dst_dd_i=0x7f4eda021080, row_low=0, row_high=8192, src1_ncols=2, src1_padded_row_size=8192, stream=0x55a64a679640) at ggml/src/ggml-cuda.cu:1240
1240	        CUBLAS_CHECK(cublasSetStream(ctx.cublas_handle(id), stream));
#7  0x000055a63dc7e28a in ggml_cuda_op_mul_mat (ctx=..., src0=0x55a64a866b00, src1=0x55a63fabdb20, dst=0x55a63fabdc90, op=0x55a63dc78880 <ggml_cuda_op_mul_mat_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, int64_t, int64_t, int64_t, int64_t, cudaStream_t)>, quantize_src1=<optimised out>) at ggml/src/ggml-cuda.cu:1612
1612	                op(ctx, src0, src1, dst, src0_dd_i, src1_ddf_i, src1_ddq_i, dst_dd_i,
#8  0x000055a63dc813ec in ggml_cuda_mul_mat (ctx=..., src0=0x55a64a866b00, src1=0x55a63fabdb20, dst=dst@entry=0x55a63fabdc90) at ggml/src/ggml-cuda.cu:1945
1945	        ggml_cuda_op_mul_mat(ctx, src0, src1, dst, ggml_cuda_op_mul_mat_cublas, nullptr);
#9  0x000055a63dc832bd in ggml_cuda_compute_forward (dst=0x55a63fabdc90, ctx=...) at ggml/src/ggml-cuda.cu:2237
2237	                ggml_cuda_mul_mat(ctx, dst->src[0], dst->src[1], dst);
#10 ggml_backend_cuda_graph_compute (backend=<optimised out>, cgraph=0x55a63f8fa458) at ggml/src/ggml-cuda.cu:2598
2598	                bool ok = ggml_cuda_compute_forward(*cuda_ctx, node);
#11 0x000055a63dea2255 in ggml_backend_sched_compute_splits (sched=0x55a63f625a90) at ggml/src/ggml-backend.c:1790
1790	            enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
#12 ggml_backend_sched_graph_compute_async (sched=0x55a63f625a90, graph=<optimised out>) at ggml/src/ggml-backend.c:1977
1977	    return ggml_backend_sched_compute_splits(sched);
#13 0x000055a63def60b0 in llama_graph_compute (n_threads=12, gf=0x55a63f9c4aa0, lctx=...) at src/llama.cpp:14421
14421	    ggml_backend_sched_graph_compute_async(lctx.sched, gf);
#14 llama_decode_internal (batch_all=..., batch_all=..., lctx=...) at src/llama.cpp:14634
14634	        llama_graph_compute(lctx, gf, n_threads);
#15 llama_decode (ctx=0x55a63f6eab50, batch=...) at src/llama.cpp:18416
18416	    const int ret = llama_decode_internal(*ctx, batch);
#16 0x000055a63dfc8905 in llama_init_from_gpt_params (params=...) at common/common.cpp:2132
2132	        llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch), 0, 0));
#17 0x000055a63e06c239 in server_context::load_model (this=this@entry=0x7ffd079d6f40, params_=...) at examples/server/server.cpp:680
680	        std::tie(model, ctx) = llama_init_from_gpt_params(params);
#18 0x000055a63dc60c44 in main (argc=<optimised out>, argv=<optimised out>) at examples/server/server.cpp:2612
2612	    if (!ctx_server.load_model(params)) {
[Inferior 1 (process 1489) detached]
Aborted

1 reply

dspasyuk Jul 29, 2024

Try building for all, I have P4s in my machine and I have no issues, also I would not use F16 models but Q5-Q6 max.

git clone https://github.com/ggerganov/llama.cpp.git; cd llama.cpp; sed -i 's/-arch=native/-arch=all/g' Makefile; make clean && LLAMA_CUDA=1 make -j 6;

tigran123 · 2024-07-29T15:16:29Z

tigran123
Jul 29, 2024
Author

Hmmm, the -march=native has to do with the CPU architecture and not with the CUDA compute engine versions of the GPUs as far as I remember. Are you sure that this will solve the problem? I mean, of course I can try, but I highly doubt this as it seems irrelevant. Is there no way to specify multiple compute engines via CUDA_DOCKER_ARCH environment variable?

1 reply

dspasyuk Jul 29, 2024

I am not using docker but I would assume you can modify CUDA_DOCKER_ARCH to all.
in sed instruction above it is not -march it is -arch, it changes MK_NVCCFLAGS.
ifdef CUDA_DOCKER_ARCH
MK_NVCCFLAGS += -Wno-deprecated-gpu-targets -arch=$(CUDA_DOCKER_ARCH)
else ifndef CUDA_POWER_ARCH
MK_NVCCFLAGS += -arch=native
endif # CUDA_DOCKER_ARCH

tigran123 · 2024-07-29T15:30:44Z

tigran123
Jul 29, 2024
Author

Oh, I am certainly not using docker either and I assumed that the variable was just badly misnamed and has no connection with docker :) So, if I don't set CUDA_DOCKER_ARCH variable then I get this error:

$ unset CUDA_DOCKER_ARCH
$ make -j12 LLAMA_CUDA=1
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -DGGML_CUDA_USE_GRAPHS  -std=c11   -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=all -mtune=native -fopenmp -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -g -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -fopenmp  -march=all -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -Iggml/include -Iggml/src -Iinclude -Isrc -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_OPENMP -DGGML_USE_LLAMAFILE -DGGML_USE_CUDA -I/usr/local/cuda/include -I/usr/local/cuda/targets/x86_64-linux/include -DGGML_CUDA_USE_GRAPHS 
I NVCCFLAGS: -std=c++11 -O3 -g -use_fast_math --forward-unknown-to-host-compiler -arch=all -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 
I LDFLAGS:   -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/usr/lib64 -L/usr/local/cuda/targets/x86_64-linux/lib -L/usr/local/cuda/lib64/stubs -L/usr/lib/wsl/lib 
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       c++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I NVCC:      Build cuda_11.5.r11.5/compiler.30672275_0
Makefile:951: *** I ERROR: For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided via environment variable CUDA_DOCKER_ARCH, e.g. by running "export CUDA_DOCKER_ARCH=compute_XX" on Unix-like systems, where XX is the minimum compute capability that the code needs to run on. A list with compute capabilities can be found here: https://developer.nvidia.com/cuda-gpus . Stop.

Here is my nvcc version:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

1 reply

dspasyuk Jul 29, 2024

oh, that explains things, upgrade to cuda 12+

tigran123 · 2024-07-29T15:43:13Z

tigran123
Jul 29, 2024
Author

Oh dear, the version of nvidia-cuda-toolkit in Ubuntu 22.04 is 11.5. Does this mean I have to install one manually instead of using what comes with Ubuntu 22.04?

1 reply

dspasyuk Jul 29, 2024

you should have 12.5 version, I have it my apt:
sudo apt-cache search libcudnn

nvidia-cudnn - NVIDIA CUDA Deep Neural Network library (install script)
libcudnn8 - cuDNN runtime libraries
libcudnn8-dev - cuDNN development libraries and headers
libcudnn8-samples - cuDNN samples
libcudnn9-cuda-11 - cuDNN runtime libraries for CUDA 11.8
libcudnn9-cuda-12 - cuDNN runtime libraries for CUDA 12.5
libcudnn9-dev-cuda-11 - cuDNN development headers and symlinks for CUDA 11.8
libcudnn9-dev-cuda-12 - cuDNN development headers and symlinks for CUDA 12.5
libcudnn9-samples - cuDNN samples
libcudnn9-static-cuda-11 - cuDNN static libraries for CUDA 11.8
libcudnn9-static-cuda-12 - cuDNN static libraries for CUDA 12.5

tigran123 · 2024-07-29T16:35:46Z

tigran123
Jul 29, 2024
Author

Oh, no, that means removing the nvidia kernel driver and installing cuda_12.5.1_555.42.06_linux.run manually, right?

1 reply

dspasyuk Jul 29, 2024

Try these:

sudo apt update && sudo apt upgrade
sudo apt autoremove nvidia* --purge
sudo apt install nvidia-driver-545
reboot
nvidia-smi
sudo apt update && sudo apt upgrade
sudo apt install nvidia-cuda-toolkit
nvcc --version

tigran123 · 2024-07-29T17:06:21Z

tigran123
Jul 29, 2024
Author

I tried to install manually cuda 12.5.1 but it failed. Of course I removed nvidia-driver-525 and disabled noveau driver and rebooted, but it still failed due to compiler problem cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'

So I will now try your suggestion instead. Thank you.

0 replies

tigran123 · 2024-07-29T17:16:03Z

tigran123
Jul 29, 2024
Author

Oh no, even after installing nvidia-driver-545 and doing sudo apt update ; sudo apt upgrade still the latest version of nvidia-cuda-toolkit available for my system is 11.5.1-1ubuntu1:

$ apt show nvidia-cuda-toolkit
Package: nvidia-cuda-toolkit
Version: 11.5.1-1ubuntu1
Priority: extra
Section: multiverse/devel
Origin: Ubuntu
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Debian NVIDIA Maintainers <pkg-nvidia-devel@lists.alioth.debian.org>
Bugs: https://bugs.launchpad.net/ubuntu/+filebug
Installed-Size: 126 MB
Depends: nvidia-profiler (= 11.5.114~11.5.1-1ubuntu1), nvidia-cuda-dev (= 11.5.1-1ubuntu1), nvidia-opencl-dev (= 11.5.1-1ubuntu1) | opencl-dev, g++-11 | g++-11 | clang-12 | g++-10 | clang-11 | clang-10 | g++-9 | clang-9 | g++-8 | clang-8 | clang-7 | g++-7 | clang-6.0 | clang (<< 1:13~) | g++-6, gcc-11 | gcc-11 | clang-12 | gcc-10 | clang-11 | clang-10 | gcc-9 | clang-9 | gcc-8 | clang-8 | clang-7 | gcc-7 | clang-6.0 | clang (<< 1:13~) | gcc-6, libc6 (>= 2.14), libgcc-s1 (>= 4.2), libstdc++6 (>= 4.6)
Recommends: nvidia-cuda-toolkit-doc (= 11.5.1-1ubuntu1), nvidia-cuda-gdb (= 11.5.114~11.5.1-1ubuntu1), nvidia-visual-profiler (= 11.5.114~11.5.1-1ubuntu1), nsight-compute (= 2021.3.1.4~11.5.1-1ubuntu1), nsight-systems (= 2021.3.3.2~11.5.1-1ubuntu1)
Breaks: nvidia-cuda-doc (<< 10.2.89-3)
Homepage: https://developer.nvidia.com/cuda-zone
Download-Size: 62.8 MB
APT-Sources: http://gb.archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages
Description: NVIDIA CUDA development toolkit
 The Compute Unified Device Architecture (CUDA) enables NVIDIA graphics
 processing units (GPUs) to be used for massively parallel general purpose
 computation.
 .
 This package contains the nvcc compiler and other tools needed for
 building CUDA applications.
 .
 Running CUDA applications requires a supported NVIDIA GPU and the NVIDIA
 driver kernel module.

what can I do? Investigate why manual installation of cuda-12.5.1 fails with that compiler issue? Or is there a better way? I had no idea that upgrading CUDA Toolkit on a relatively recent OS (Ubuntu 22.04.4) is such a nightmare...

3 replies

dspasyuk Jul 29, 2024

Hmm, strange I literally did that yesterday on the Ubuntu 22.04.4. You should be able to purge all nvidia stuff, and reinstall it. Do you have cuda-cudart-12-3, libcublas-12-3: This is what I have:

sudo apt-cache search cudar

libcudart11.0 - NVIDIA CUDA Runtime Library
cuda-cudart-11-7 - CUDA Runtime native Libraries
cuda-cudart-11-8 - CUDA Runtime native Libraries
cuda-cudart-12-0 - CUDA Runtime native Libraries
cuda-cudart-12-1 - CUDA Runtime native Libraries
cuda-cudart-12-2 - CUDA Runtime native Libraries
cuda-cudart-12-3 - CUDA Runtime native Libraries
cuda-cudart-12-4 - CUDA Runtime native Libraries
cuda-cudart-12-5 - CUDA Runtime native Libraries
cuda-cudart-dev-11-7 - CUDA Runtime native dev links, headers
cuda-cudart-dev-11-8 - CUDA Runtime native dev links, headers
cuda-cudart-dev-12-0 - CUDA Runtime native dev links, headers
cuda-cudart-dev-12-1 - CUDA Runtime native dev links, headers
cuda-cudart-dev-12-2 - CUDA Runtime native dev links, headers
cuda-cudart-dev-12-3 - CUDA Runtime native dev links, headers
cuda-cudart-dev-12-4 - CUDA Runtime native dev links, headers
cuda-cudart-dev-12-5 - CUDA Runtime native dev links, headers

tigran123 Jul 29, 2024
Author

No, I only have 11.0:

$ sudo apt-cache search cudar
libcudart11.0 - NVIDIA CUDA Runtime Library

dspasyuk Jul 29, 2024

what is your output of lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy

tigran123 · 2024-07-29T17:37:34Z

tigran123
Jul 29, 2024
Author

But the strange thing is that on that machine the kernel is stuck at the version 6.5.0-17 for some reason, but on all my other Ubuntu 22.04.4 machines the kernel is version 6.5.0-41. But still even on the other machines the nvidia-cuda-toolkit goes only up to 11.5

my lsb_relase -a is:

$ sudo lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.4 LTS
Release:	22.04
Codename:	jammy

0 replies

tigran123 · 2024-07-29T17:39:03Z

tigran123
Jul 29, 2024
Author

Maybe I should try installing nvidia-driver-555 manually and then install cuda-12.5.1 with the driver option disabled?

Ah, no the driver version only goes up to 550, but cuda 12.5.1 requires 555.

0 replies

dspasyuk · 2024-07-29T17:43:23Z

dspasyuk
Jul 29, 2024

try using the official installation guide from Nvidia: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local

you have to purge first: sudo apt autoremove nvidia* --purge

0 replies

tigran123 · 2024-07-29T18:03:11Z

tigran123
Jul 29, 2024
Author

Ok, done all that successfully, now I have 555 driver:

$ dmesg | grep "NVIDIA UNIX Open"
[    9.545609] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  555.42.06  Release Build  (dvs-builder@U16-I3-A13-3-4)  Tue Jun  4 00:45:31 UTC 2024
[    9.574240] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  555.42.06  Release Build  (dvs-builder@U16-I3-A13-3-4)  Tue Jun  4 00:29:54 UTC 2024

All other installation steps completed as well. However, I do not have nvcc at all. And if I type nvcc I get a suggestion to install nvidia-cuda-toolkit, but it is still the wrong version:

$ nvcc
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit
$ apt list | grep nvidia-cuda-toolkit

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

nvidia-cuda-toolkit-doc/jammy,jammy 11.5.1-1ubuntu1 all
nvidia-cuda-toolkit-gcc/jammy 11.5.1-1ubuntu1 amd64
nvidia-cuda-toolkit/jammy 11.5.1-1ubuntu1 amd64

0 replies

dspasyuk · 2024-07-29T18:11:33Z

dspasyuk
Jul 29, 2024

Did you use the Nvidia official setup?

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.5.1/local_installers/cuda-repo-ubuntu2204-12-5-local_12.5.1-555.42.06-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-5-local_12.5.1-555.42.06-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-5-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get updatesudo apt-get -y install cuda-toolkit-12-5

2 replies

tigran123 Jul 29, 2024
Author

cuda-toolkit-12-5 is already the newest version (12.5.1-1).

but obviously nvcc is not part of it.

tigran123 Jul 29, 2024
Author

And yes, I have done all the steps as per those instructions. And all succeeded.

tigran123 · 2024-07-29T18:24:46Z

tigran123
Jul 29, 2024
Author

$ sudo apt install cuda-nvcc-12-5
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
cuda-nvcc-12-5 is already the newest version (12.5.82-1).
0 to upgrade, 0 to newly install, 0 to remove and 2 not to upgrade.
$ nvcc
Command 'nvcc' not found, but can be installed with:
sudo apt install nvidia-cuda-toolkit

4 replies

dspasyuk Jul 29, 2024

could be just environment did not setup properly,
do you have nvcc in: /usr/bin/nvcc or /usr/local/cuda-12.5/bin/nvcc

tigran123 Jul 29, 2024
Author

Ah, bingo:

$ /usr/local/cuda-12.5/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0

Ok, so I shall add it to the PATH. Is there anything else I should set, like LD_LIBRARY_PATH or the like?
Thank you very much for all you help! I will recompile llama.cpp and see the difference! :)

dspasyuk Jul 29, 2024

Nice! Yes, put these in .bashrc
export PATH="/usr/local/cuda-12.5/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-12.5/lib64:$LD_LIBRARY_PATH"

and source it: source ~/.bashrc

dspasyuk Jul 29, 2024

the installation was not that painful for me, I am not sure what is the difference

tigran123 · 2024-07-31T15:22:19Z

tigran123
Jul 31, 2024
Author

Oh dear, the problem is now worse, the P40 GPU is gone! The 555 driver does not see it!

$ nvidia-smi 
Wed Jul 31 16:21:35 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   41C    P8             18W /  250W |       1MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

0 replies

tigran123 · 2024-07-31T15:25:13Z

tigran123
Jul 31, 2024
Author

Here is what I see in dmesg:

[    9.633162] NVRM: The NVIDIA GPU 0000:02:00.0 (PCI ID: 10de:1b38)
               NVRM: installed in this system is not supported by open
               NVRM: nvidia.ko because it does not include the required GPU
               NVRM: System Processor (GSP).
               NVRM: Please see the 'Open Linux Kernel Modules' and 'GSP
               NVRM: Firmware' sections in the driver README, available on
               NVRM: the Linux graphics driver download page at
               NVRM: www.nvidia.com.

$ cat /proc/driver/nvidia/version 
NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  555.42.06  Release Build  (dvs-builder@U16-I3-A13-3-4)  Tue Jun  4 00:45:31 UTC 2024
GCC version:  gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)

2 replies

dspasyuk Jul 31, 2024

@tigran123 Open Additional Driver setting dialog, should look like this and install any non open driver above 525.

Answer selected by tigran123

tigran123 Jul 31, 2024
Author

Ok, I'll install it via apt as I don't run X11 on that machine normally (I can, of course, but it should be the same). So, I'll choose the highest available driver number.

tigran123 · 2024-07-31T15:42:47Z

tigran123
Jul 31, 2024
Author

Yes! Bravo, Denis!

$ nvidia-smi
Wed Jul 31 16:42:18 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:01:00.0 Off |                  N/A |
|  0%   54C    P0             63W /  250W |       1MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P40                      Off |   00000000:02:00.0 Off |                  Off |
| N/A   58C    P0             57W /  250W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

1 reply

dspasyuk Jul 31, 2024

@tigran123 Yeee! Very nice! Would be nice to see what is your experience with P40 performance using llama.cpp:

For example:
./llama-bench -fa 1 -m ../models/qwen2-7b-instruct-q5_k_m.gguf

tigran123 · 2024-07-31T16:53:45Z

tigran123
Jul 31, 2024
Author

Hmmm, at the moment I am not happy with the performance of 70B F16 original. I think without P40 (just split between RTX 2080 Ti 11GB and CPU 128GB) it was much faster. Also, killing llama-server process via ^C leaves a zombie process wasting 100% of one cpu:

$ ps aux | grep llama
tigran      1239 73.7  0.0      0     0 pts/2    Zl+  17:20  23:43 [llama-server] <defunct>

But I will try smaller quantised versions like Q8 or Q6.

1 reply

dspasyuk Jul 31, 2024

@tigran123 Yes, normally I use q5_k_m models perplexity is practically the same as other q5+ models.

tigran123 · 2024-07-31T18:07:07Z

tigran123
Jul 31, 2024
Author

Oh dear, using 8B F16 Llama3 I got this error:

 1184.480970] pcieport 0000:00:02.0: AER: Uncorrected (Fatal) error received: 0000:00:02.0
[ 1184.480986] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[ 1184.481228] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00000020/00000000
[ 1184.481266] pcieport 0000:00:02.0:    [ 5] SDES                   (First)
[ 1184.481303] nvidia 0000:02:00.0: AER: can't recover (no error_detected callback)
[ 1185.105293] NVRM: GPU at PCI:0000:02:00: GPU-1a24d9cb-1739-473b-4aae-469f38b0c5ca
[ 1185.105303] NVRM: GPU Board Serial Number: 0324017003805
[ 1185.105306] NVRM: Xid (PCI:0000:02:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[ 1185.105316] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
[ 1185.105319] NVRM: GPU 0000:02:00.0: GPU serial number is 0324017003805.
[ 1185.105329] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.
[ 1185.525828] pcieport 0000:00:02.0: broken device, retraining non-functional downstream link at 2.5GT/s
[ 1186.533834] pcieport 0000:00:02.0: retraining failed
[ 1187.789853] pcieport 0000:00:02.0: broken device, retraining non-functional downstream link at 2.5GT/s
[ 1188.797857] pcieport 0000:00:02.0: retraining failed

I hope this is just a one off and not a broken P40 card. Is there a way to properly run some testing diagnostics on it somehow?

1 reply

dspasyuk Jul 31, 2024

interesting, fingers crossed

tigran123 · 2024-08-01T09:58:12Z

tigran123
Aug 1, 2024
Author

Hmmm, yes, the P40 GPU is broken, returning back to eBay seller. I consistently reproduce this problem under high load, using Blender (via OptiX), it causes these errors in the log and blender process hangs wasting 100% cpu:

[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:02.0
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00004000/00000000
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0:    [14] CmpltTO                (First)
[Thu Aug  1 07:52:24 2024] nvidia 0000:02:00.0: AER: can't recover (no error_detected callback)
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0: AER: device recovery failed
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:02.0
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00004000/00000000
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0:    [14] CmpltTO                (First)
[Thu Aug  1 07:52:24 2024] nvidia 0000:02:00.0: AER: can't recover (no error_detected callback)
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0: AER: device recovery failed
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:02.0
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00004000/00000000
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0:    [14] CmpltTO                (First)
[Thu Aug  1 07:52:24 2024] nvidia 0000:02:00.0: AER: can't recover (no error_detected callback)
[Thu Aug  1 07:52:24 2024] NVRM: GPU at PCI:0000:02:00: GPU-1a24d9cb-1739-473b-4aae-469f38b0c5ca
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0: AER: device recovery failed
[Thu Aug  1 07:52:24 2024] NVRM: GPU Board Serial Number: 0324017003805
[Thu Aug  1 07:52:24 2024] NVRM: Xid (PCI:0000:02:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[Thu Aug  1 07:52:24 2024] NVRM: GPU 0000:02:00.0: GPU has fallen off the bus.
[Thu Aug  1 07:52:24 2024] NVRM: GPU 0000:02:00.0: GPU serial number is 0324017003805.
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0: AER: Multiple Uncorrected (Fatal) error received: 0000:00:02.0
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00004020/00000000
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0:    [ 5] SDES                  
[Thu Aug  1 07:52:24 2024] pcieport 0000:00:02.0:    [14] CmpltTO                (First)
[Thu Aug  1 07:52:24 2024] nvidia 0000:02:00.0: AER: can't recover (no error_detected callback)
[Thu Aug  1 07:52:25 2024] pcieport 0000:00:02.0: broken device, retraining non-functional downstream link at 2.5GT/s
[Thu Aug  1 07:52:26 2024] pcieport 0000:00:02.0: retraining failed
[Thu Aug  1 07:52:28 2024] pcieport 0000:00:02.0: broken device, retraining non-functional downstream link at 2.5GT/s
[Thu Aug  1 07:52:29 2024] pcieport 0000:00:02.0: retraining failed
[Thu Aug  1 07:52:29 2024] nvidia 0000:02:00.0: not ready 1023ms after bus reset; waiting
[Thu Aug  1 07:52:30 2024] nvidia 0000:02:00.0: not ready 2047ms after bus reset; waiting
[Thu Aug  1 07:52:32 2024] nvidia 0000:02:00.0: not ready 4095ms after bus reset; waiting
[Thu Aug  1 07:52:36 2024] nvidia 0000:02:00.0: not ready 8191ms after bus reset; waiting
[Thu Aug  1 07:52:44 2024] nvidia 0000:02:00.0: not ready 16383ms after bus reset; waiting
[Thu Aug  1 07:53:02 2024] nvidia 0000:02:00.0: not ready 32767ms after bus reset; waiting

But all your help was certainly NOT in vain. First of all, I am going to get a P40 GPU replacement most likely. And secondly, even with just RTX 2080 Ti it is much nicer to use CUDA 12.5 than 11.5. Thank you again!

1 reply

dspasyuk Aug 1, 2024

@tigran123 No problem, I am glad you figured it out!

How to properly use llama.cpp with multiple NVIDIA GPUs with different CUDA compute engine versions? #8725

Uh oh!

Uh oh!

Replies: 19 comments · 20 replies

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

Uh oh!

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

Uh oh!

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

Uh oh!

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

Uh oh!

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

Uh oh!

tigran123 Jul 29, 2024 Author

Uh oh!

Uh oh!

Replies: 19 comments 20 replies

tigran123
Jul 29, 2024
Author

tigran123
Jul 29, 2024
Author

tigran123
Jul 29, 2024
Author

tigran123
Jul 29, 2024
Author

tigran123
Jul 29, 2024
Author

tigran123
Jul 29, 2024
Author

tigran123
Jul 29, 2024
Author

tigran123 Jul 29, 2024
Author

tigran123
Jul 29, 2024
Author

tigran123
Jul 29, 2024
Author

tigran123
Jul 29, 2024
Author

tigran123 Jul 29, 2024
Author

tigran123 Jul 29, 2024
Author

tigran123
Jul 29, 2024
Author

tigran123 Jul 29, 2024
Author