-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Apologies if I have missed this topic, and it is already covered in the documentation/CRDs.
When using an NVIDIA accelerator in aideployments.premlabs.io
, I was wondering if it is possible to provide multiple GPUs.
In a cluster with multiple GPUs:
kubectl get node lab-infra-node-3 -o yaml | grep 'nvidia.com/gpu:'
nvidia.com/gpu: "3"
Only a single GPU was requested by the deployment:
kubectl -n premai get pods hermes-758f784656-79fbv -o yaml | grep 'nvidia.com/gpu'
nvidia.com/gpu: "1"
In this case, I have used a deployment from https://github.com/premAI-io/prem-operator/blob/main/examples/big-agi.yaml.
I would be interested in providing more GPUs to a single deployment.
Also, related to MIGs, in a cluster where GPUs are not labeled as nvidia.com/gpu
:
kubectl get nodes node-with-mig -o yaml | grep 'nvidia.com/mig'
nvidia.com/mig-7g.40gb.count: "4"
nvidia.com/mig-7g.40gb.engines.copy: "7"
nvidia.com/mig-7g.40gb.engines.decoder: "5"
nvidia.com/mig-7g.40gb.engines.encoder: "0"
nvidia.com/mig-7g.40gb.engines.jpeg: "1"
nvidia.com/mig-7g.40gb.engines.ofa: "1"
nvidia.com/mig-7g.40gb.memory: "40192"
nvidia.com/mig-7g.40gb.multiprocessors: "98"
nvidia.com/mig-7g.40gb.product: NVIDIA-A100-SXM4-40GB-MIG-7g.40gb
nvidia.com/mig-7g.40gb.replicas: "1"
nvidia.com/mig-7g.40gb.slices.ci: "7"
nvidia.com/mig-7g.40gb.slices.gi: "7"
nvidia.com/mig.capable: "true"
nvidia.com/mig.config: all-7g.40gb
nvidia.com/mig.config.state: failed
nvidia.com/mig.strategy: mixed
nvidia.com/mig-1g.5gb: "0"
nvidia.com/mig-2g.10gb: "0"
nvidia.com/mig-3g.20gb: "0"
nvidia.com/mig-7g.40gb: "4"
nvidia.com/mig-1g.5gb: "0"
nvidia.com/mig-2g.10gb: "0"
nvidia.com/mig-3g.20gb: "0"
nvidia.com/mig-7g.40gb: "4"
Will aideployments.premlabs.io
be able to select a MIG?
I skimmed through the codebase briefly and I observed that an nvidia.com/gpu
label is hard codded.
https://github.com/premAI-io/prem-operator/blob/0322a6b8f9451349d7896030b75247707c1f3131/controllers/constants/labels.go#L4
(Apologies again, I am not able to test it myself at the moment on the MIG cluster).