NVIDIA DRA Driver for GPUs

Enables

flexible and powerful allocation and dynamic reconfiguration of GPUs as well as
allocation of ComputeDomains for robust and secure Multi-Node NVLink.

For Kubernetes 1.32 or newer, with Dynamic Resource Allocation (DRA) enabled.

Overview

DRA is a novel concept in Kubernetes for flexibly requesting, configuring, and sharing specialized devices like GPUs. To learn more about DRA in general, good starting points are: Kubernetes docs, GKE docs, Kubernetes blog.

Most importantly, DRA puts resource configuration and scheduling in the hands of 3rd-party vendors.

The NVIDIA DRA Driver for GPUs manages two types of resources: GPUs and ComputeDomains. Correspondingly, it contains two DRA kubelet plugins: gpu-kubelet-plugin, compute-domain-kubelet-plugin. Upon driver installation, each of these two parts can be enabled or disabled separately.

The two sections below provide a brief overview for each of these two parts of this DRA driver.

`ComputeDomain`s

An abstraction for robust and secure Multi-Node NVLink (MNNVL). Officially supported.

An individual ComputeDomain (CD) guarantees MNNVL-reachability between pods that are in the CD, and secure isolation from other pods that are not in the CD.

In terms of placement, a CD follows the workload. In terms of lifetime, a CD is ephemeral: its lifetime is bound to the lifetime of the consuming workload. For more background on how ComputeDomains facilitate orchestrating MNNVL workloads on Kubernetes (and on NVIDIA GB200 systems in particular), see this doc and this slide deck. For an outlook and specific plans for improvements, please refer to these release notes.

If you've heard about IMEX: this DRA driver orchestrates IMEX primitives (daemons, domains, channels) under the hood.

`GPU`s

The GPU allocation side of this DRA driver will enable powerful features (such as dynamic allocation of MIG devices). To learn about what we're planning to build, please have a look at these release notes.

While some GPU allocation features can be tried out, they are not yet officially supported. Hence, the GPU kubelet plugin is currently disabled by default in the Helm chart installation.

For exploration and demonstration purposes, see the "demo" section below, and also browse the demo/specs/quickstart directory in this repository.

Installation

As of today, the recommended installation method is via Helm. Detailed instructions can (for now) be found here. In the future, this driver will be included in the NVIDIA GPU Operator and does not need to be installed separately anymore.

A (kind) demo

Below, we demonstrate a basic use case: sharing a single GPU across two containers running in the same Kubernetes pod.

Step 1: install dependencies

Running this demo requires

kind (follow the official installation docs)
NVIDIA Container Toolkit & Runtime (follow a previous version of this readme for setup instructions)

Step 2: create kind cluster with the DRA driver installed

Start by cloning this repository, and cdin into it:

git clone https://github.com/NVIDIA/k8s-dra-driver-gpu.git
cd k8s-dra-driver-gpu

Next up, build this driver's container image and create a kind-based Kubernetes cluster:

export KIND_CLUSTER_NAME="kind-dra-1"
./demo/clusters/kind/build-dra-driver-gpu.sh
./demo/clusters/kind/create-cluster.sh

Now you can install the DRA driver's Helm chart into the Kubernetes cluster:

./demo/clusters/kind/install-dra-driver-gpu.sh

Step 3: run workload

Submit workload:

kubectl apply -f ./demo/specs/quickstart/gpu-test2.yaml

If you're curious, have a look at the ResourceClaimTemplate definition in this spec, and how the corresponding single ResourceClaim is being referenced by both containers.

Container log inspection then indeed reveals that both containers operate on the same GPU device:

$ kubectl logs pod -n gpu-test2 --all-containers --prefix
[pod/pod/ctr0] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
[pod/pod/ctr1] GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)

Contributing

Contributions require a Developer Certificate of Origin (DCO, see CONTRIBUTING.md).

Support

Please open an issue on the GitHub project for questions and for reporting problems. Your feedback is appreciated!

Name		Name	Last commit message	Last commit date
Latest commit History 804 Commits
.github		.github
api/nvidia.com/resource/v1beta1		api/nvidia.com/resource/v1beta1
cmd		cmd
demo		demo
deployments		deployments
hack		hack
internal/info		internal/info
pkg		pkg
templates		templates
vendor		vendor
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
.nvidia-ci.yml		.nvidia-ci.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
common.mk		common.mk
go.mod		go.mod
go.sum		go.sum
versions.mk		versions.mk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NVIDIA DRA Driver for GPUs

Overview

`ComputeDomain`s

`GPU`s

Installation

A (kind) demo

Contributing

Support

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 20

Languages

License

NVIDIA/k8s-dra-driver-gpu

Folders and files

Latest commit

History

Repository files navigation

NVIDIA DRA Driver for GPUs

Overview

ComputeDomains

GPUs

Installation

A (kind) demo

Contributing

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 20

Languages

`ComputeDomain`s

`GPU`s

Packages