Want to run on 1 GPU (of 2 total) but looks like the model is loaded onto both GPUs. #2752
-
Using the CuBLAS build with 2 GPUs. I want to load the model onto a single GPU, but the model is always loaded into the memory of both GPUs. Even if I only run on 1 GPU the model is loaded onto both GPUs. Things I've tried:
When I run a query I can tell from the power usage that only GPU 0 is being used. But when I check nvidia-smi I see that the model is loaded to both GPUs. That is, if the model required 8 GB both GPUs show 8 GB of memory is used. Any idea what's going on here? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 6 replies
-
@isaacmorgan,
|
Beta Was this translation helpful? Give feedback.
-
When I try |
Beta Was this translation helpful? Give feedback.
-
Did you started with CUDA_VISIBLE_DEVICES=0 ./build/bin/main ? |
Beta Was this translation helpful? Give feedback.
-
I found the cause for this. It is not a problem with LlamaCPP. X11 config had BaseMosaic enabled, which caused (I don't fully understand why) the behavior. https://forums.developer.nvidia.com/t/unwanted-duplicate-threads-processes-on-dual-p6000/155178/3 |
Beta Was this translation helpful? Give feedback.
I found the cause for this. It is not a problem with LlamaCPP.
X11 config had BaseMosaic enabled, which caused (I don't fully understand why) the behavior.
https://forums.developer.nvidia.com/t/unwanted-duplicate-threads-processes-on-dual-p6000/155178/3
https://forums.developer.nvidia.com/t/memory-is-allocated-on-all-gpus/183110