Skip to content

Support For Multiple GPUs And NVIDIA MIG(s)? #39

@vkhitrin

Description

@vkhitrin

Apologies if I have missed this topic, and it is already covered in the documentation/CRDs.

When using an NVIDIA accelerator in aideployments.premlabs.io, I was wondering if it is possible to provide multiple GPUs.
In a cluster with multiple GPUs:

kubectl get node lab-infra-node-3 -o yaml | grep 'nvidia.com/gpu:'
    nvidia.com/gpu: "3"

Only a single GPU was requested by the deployment:

kubectl -n premai get pods hermes-758f784656-79fbv -o yaml | grep 'nvidia.com/gpu'
        nvidia.com/gpu: "1"

In this case, I have used a deployment from https://github.com/premAI-io/prem-operator/blob/main/examples/big-agi.yaml.
I would be interested in providing more GPUs to a single deployment.

Also, related to MIGs, in a cluster where GPUs are not labeled as nvidia.com/gpu:

kubectl get nodes node-with-mig -o yaml | grep 'nvidia.com/mig'
    nvidia.com/mig-7g.40gb.count: "4"
    nvidia.com/mig-7g.40gb.engines.copy: "7"
    nvidia.com/mig-7g.40gb.engines.decoder: "5"
    nvidia.com/mig-7g.40gb.engines.encoder: "0"
    nvidia.com/mig-7g.40gb.engines.jpeg: "1"
    nvidia.com/mig-7g.40gb.engines.ofa: "1"
    nvidia.com/mig-7g.40gb.memory: "40192"
    nvidia.com/mig-7g.40gb.multiprocessors: "98"
    nvidia.com/mig-7g.40gb.product: NVIDIA-A100-SXM4-40GB-MIG-7g.40gb
    nvidia.com/mig-7g.40gb.replicas: "1"
    nvidia.com/mig-7g.40gb.slices.ci: "7"
    nvidia.com/mig-7g.40gb.slices.gi: "7"
    nvidia.com/mig.capable: "true"
    nvidia.com/mig.config: all-7g.40gb
    nvidia.com/mig.config.state: failed
    nvidia.com/mig.strategy: mixed
    nvidia.com/mig-1g.5gb: "0"
    nvidia.com/mig-2g.10gb: "0"
    nvidia.com/mig-3g.20gb: "0"
    nvidia.com/mig-7g.40gb: "4"
    nvidia.com/mig-1g.5gb: "0"
    nvidia.com/mig-2g.10gb: "0"
    nvidia.com/mig-3g.20gb: "0"
    nvidia.com/mig-7g.40gb: "4"

Will aideployments.premlabs.io be able to select a MIG?
I skimmed through the codebase briefly and I observed that an nvidia.com/gpu label is hard codded.
https://github.com/premAI-io/prem-operator/blob/0322a6b8f9451349d7896030b75247707c1f3131/controllers/constants/labels.go#L4
(Apologies again, I am not able to test it myself at the moment on the MIG cluster).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions