Want to run on 1 GPU (of 2 total) but looks like the model is loaded onto both GPUs. #2752

isaacmorgan · 2023-08-23T21:48:33Z

isaacmorgan
Aug 23, 2023

Using the CuBLAS build with 2 GPUs. I want to load the model onto a single GPU, but the model is always loaded into the memory of both GPUs. Even if I only run on 1 GPU the model is loaded onto both GPUs.

Things I've tried:
./main -m ./llama-2-7b.ggmlv3.q8_0.bin -i --interactive-first -ngl 40
Result: Default behavior: Loads model onto both GPUs, runs on both GPUs.

CUDA_VISIBLE_DEVICES=0 ./main -m ./models/llama-2-7b.ggmlv3.q8_0.bin -i --interactive-first -ngl 40
Result: Loads model onto both GPUs, runs only on GPU 0

CUDA_VISIBLE_DEVICES=1 ./main -m ./models/llama-2-7b.ggmlv3.q8_0.bin -i --interactive-first -ngl 40
Result: Crashes: ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA RTX A5000, compute capability 8.6 Segmentation fault (core dumped)

./main -m ./models/llama-2-7b.ggmlv3.q8_0.bin -i --interactive-first -ngl 40 -mg 0 -ts 1,0
Result: Crashes: CUDA error 400 at ggml-cuda.cu:3343: invalid resource handle

./main -m ./models/llama-2-7b.ggmlv3.q8_0.bin -i --interactive-first -ngl 40 -mg 0 -ts 10,1
Result: Loads model onto both GPUs, runs mostly on GPU 0 and a little on GPU 1 (power usage is 220 W on GPU 0, 80 W on GPU 1)

./main -m ./models/llama-2-7b.ggmlv3.q8_0.bin -i --interactive-first -ngl 40 -mg 1 -ts 1,10
Result: Same as above, but mostly runs on GPU 1

When I run a query I can tell from the power usage that only GPU 0 is being used.

But when I check nvidia-smi I see that the model is loaded to both GPUs. That is, if the model required 8 GB both GPUs show 8 GB of memory is used.

Any idea what's going on here?

Answered by isaacmorgan

Nov 1, 2023

I found the cause for this. It is not a problem with LlamaCPP.

X11 config had BaseMosaic enabled, which caused (I don't fully understand why) the behavior.

https://forums.developer.nvidia.com/t/unwanted-duplicate-threads-processes-on-dual-p6000/155178/3
https://forums.developer.nvidia.com/t/memory-is-allocated-on-all-gpus/183110

View full answer

ianscrivener · 2023-08-23T23:03:45Z

ianscrivener
Aug 23, 2023

@isaacmorgan,
Have you played with the two main GPU command line options?

-mg i, --main-gpu i: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used. Requires cuBLAS.

-ts SPLIT, --tensor-split SPLIT: When using multiple GPUs this option controls how large tensors should be split across all GPUs. SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance. Requires cuBLAS.

2 replies

isaacmorgan Aug 24, 2023
Author

When I try with --main-gpu 0 or --main-gpu 1 I get identical results as above. I'll update the original post to show the exact examples.

Also when running from the command line I notice that I cannot set one of the splits to 0: --tensor-split 1,0 returns CUDA error 400 at ggml-cuda.cu:3343: invalid resource handle.

ianscrivener Aug 24, 2023

CUDA errors are not good.. though they are above my pay grade.
Suggest you lodge an issue tocket.

Just confirming you have the absolutely latest code? Fixes come quickly...

pmelendez · 2023-10-14T22:02:59Z

pmelendez
Oct 14, 2023

When I try CUDA_VISIBLE_DEVICES=0 the model is only load in one gpu according to nvtop. How are you checking this? Maybe this got silently fixed?

3 replies

isaacmorgan Oct 16, 2023
Author

I updated, downloaded the new format models, and tried again, but the problem persists.

If I use CUDA_VISIBLE_DEVICES=0 the model runs on GPU 0, but it's loaded into the memory of both GPUs.
If I use CUDA_VISIBLE_DEVICES=1 I get a segfault.
If I don't set CUDA_VISIBLE_DEVICES the model runs on both GPUs, but now it only outputs an endless stream of # characters. This is new behavior, maybe it's a separate issue.

I'm checking this with nvidia-smi. I can see from the memory usage that ./main has loaded the model onto both GPU 0 and GPU 1. I can also see from the power usage which GPU is actually running computations. When inactive the GPUs draw 8 W, when running together they each draw 150 W, when running with CUDA_VISIBLE_DEVICES=0 GPU 0 uses 230 W and GPU 1 uses 8 W.

pmelendez Oct 16, 2023

I see, nvidia-smi should give you the same info than nvtop. It is definitely working for me, but one question. How did you compile it? I did have to use the cmake script with a flag as the regular make wouldn't support multiple GPU.

Have you tried building with this?

cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

isaacmorgan Oct 18, 2023
Author

I had been using make.

make LLAMA_CUBLAS=1

I tried building with cmake as suggested, but the behavior is the same.

pmelendez · 2023-10-18T18:42:14Z

pmelendez
Oct 18, 2023

I had been using make.
make LLAMA_CUBLAS=1
I tried building with cmake as suggested, but the behavior is the same.

Did you started with CUDA_VISIBLE_DEVICES=0 ./build/bin/main ?

1 reply

isaacmorgan Oct 18, 2023
Author

Yes, I used the main inside ./build/bin/

The machine has 2 RTX A5000s with compute capability 8.6.

When I run with CUDA_VISIBLE_DEVICES=1 it looks like the segfault happens within ggml_init_cublas when creating the cuda streams.

CUDA_CHECK(cudaStreamCreateWithFlags(&g_cudaStreams[id][is], cudaStreamNonBlocking));

That's one thread I can pull on. The other would be to find where the model is loaded onto the GPUs.

isaacmorgan · 2023-11-01T21:15:28Z

isaacmorgan
Nov 1, 2023
Author

I found the cause for this. It is not a problem with LlamaCPP.

X11 config had BaseMosaic enabled, which caused (I don't fully understand why) the behavior.

https://forums.developer.nvidia.com/t/unwanted-duplicate-threads-processes-on-dual-p6000/155178/3
https://forums.developer.nvidia.com/t/memory-is-allocated-on-all-gpus/183110

0 replies

Want to run on 1 GPU (of 2 total) but looks like the model is loaded onto both GPUs. #2752

Uh oh!

Uh oh!

isaacmorgan Aug 23, 2023

Replies: 4 comments · 6 replies

Uh oh!

Uh oh!

ianscrivener Aug 23, 2023

Uh oh!

isaacmorgan Aug 24, 2023 Author

Uh oh!

ianscrivener Aug 24, 2023

Uh oh!

pmelendez Oct 14, 2023

Uh oh!

isaacmorgan Oct 16, 2023 Author

Uh oh!

Uh oh!

pmelendez Oct 16, 2023

Uh oh!

isaacmorgan Oct 18, 2023 Author

Uh oh!

pmelendez Oct 18, 2023

Uh oh!

isaacmorgan Oct 18, 2023 Author

Uh oh!

isaacmorgan Nov 1, 2023 Author

isaacmorgan
Aug 23, 2023

Replies: 4 comments 6 replies

ianscrivener
Aug 23, 2023

isaacmorgan Aug 24, 2023
Author

pmelendez
Oct 14, 2023

isaacmorgan Oct 16, 2023
Author

isaacmorgan Oct 18, 2023
Author

pmelendez
Oct 18, 2023

isaacmorgan Oct 18, 2023
Author

isaacmorgan
Nov 1, 2023
Author