Skip to content

Commit 25f5355

Browse files
authored
Merge pull request #91003 from StephenJamesSmith/TELCODOCS-2226
TELCODOCS-2226: AI Distributed workloads with RDMA 1
2 parents 978cb43 + 999364a commit 25f5355

11 files changed

+1280
-0
lines changed

_topic_maps/_topic_map.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3616,6 +3616,8 @@ Topics:
36163616
File: amd-gpu-operator
36173617
- Name: Intel Gaudi AI accelerators
36183618
File: gaudi-ai-accelerator
3619+
- Name: Remote Direct Memory Access (RDMA)
3620+
File: rdma-remote-direct-memory-access
36193621
---
36203622
Name: Backup and restore
36213623
Dir: backup_and_restore
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
:_mod-docs-content-type: ASSEMBLY
2+
[id="rdma-remote-direct-memory-access"]
3+
= NVIDIA GPUDirect Remote Direct Memory Access (RDMA)
4+
include::_attributes/common-attributes.adoc[]
5+
:context: rdma-remote-direct-memory-access
6+
7+
toc::[]
8+
9+
NVIDIA GPUDirect Remote Direct Memory Access (RDMA) allows for the memory in one computer to directly access the memory of another computer without needing access through the operating system. This provides the ability to bypass kernel intervention in the process, freeing up resources and greatly reducing the CPU overhead normally needed to process network communications. This is useful for distributing GPU-accelerated workloads across clusters. And because RDMA is so suited toward high bandwidth and low latency applications, this makes it ideal for big data and machine learning applications.
10+
11+
There are currently three configuration methods for NVIDIA GPUDirect RDMA:
12+
13+
Shared device:: This method allows for an NVIDIA GPUDirect RDMA device to be shared among multiple pods on the {product-title} worker node where the device is exposed.
14+
15+
Host device:: This method provides direct physical Ethernet access on the worker node by
16+
creating an additional host network on a pod. A plugin allows the network device to be moved from the host network namespace to the network namespace on the pod.
17+
18+
SR-IOV legacy device:: The Single Root IO Virtualization (SR-IOV) method can share a single network device, such as an Ethernet adapter, with multiple pods. SR-IOV segments the device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device.
19+
20+
Each of these methods can be used across either the NVIDIA GPUDirect RDMA over Converged Ethernet (RoCE) or Infiniband infrastructures, providing an aggregate total of six methods of configuration.
21+
22+
:FeatureName: Remote Direct Memory Access
23+
24+
include::modules/rdma-prerequisites.adoc[leveloffset=+1]
25+
26+
* Install the xref:../hardware_enablement/psap-node-feature-discovery-operator.adoc#installing-the-node-feature-discovery-operator_node-feature-discovery-operator[Node Feature Discovery Operator].
27+
28+
* Install the xref:../networking/networking_operators/sr-iov-operator/installing-sriov-operator.adoc#installing-sriov-operator[SR-IOV Operator].
29+
30+
* Install the link:https://docs.nvidia.com/networking/display/kubernetes2501/getting-started-openshift.html#network-operator-installation-using-openshift-oc-cli[NVIDIA Network Operator] (NVIDIA documentation).
31+
32+
* Install the link:https://docs.nvidia.com/datacenter/cloud-native/openshift/24.9.2/install-gpu-ocp.html[NVIDIA GPU Operator] (NVIDIA documentation).
33+
34+
include::modules/rdma-disabling-irdma-kernel-module.adoc[leveloffset=+1]
35+
36+
include::modules/rdma-creating-persistent-naming-rules.adoc[leveloffset=+1]
37+
38+
include::modules/rdma-configuring-the-nfd-operator.adoc[leveloffset=+1]
39+
40+
include::modules/rdma-configuring-the-sriov-operator.adoc[leveloffset=+1]
41+
42+
include::modules/rdma-configuring-the-nvidia-network-operator.adoc[leveloffset=+1]
43+
44+
include::modules/rdma-configuring-the-gpu-operator.adoc[leveloffset=+1]
45+
Lines changed: 323 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,323 @@
1+
// Module included in the following assemblies:
2+
//
3+
// * hardware_accelerators/rdma-remote-direct-memory-access.adoc
4+
5+
:_mod-docs-content-type: PROCEDURE
6+
[id="rdma-configuring-the-gpu-operator_{context}"]
7+
8+
= Configuring the GPU Operator
9+
10+
The GPU Operator automates the management of the NVIDIA drivers, device plugins for GPUs, the NVIDIA Container Toolkit, and other components required for GPU provisioning.
11+
12+
.Prerequisites
13+
14+
* You have installed the GPU Operator.
15+
16+
.Procedure
17+
18+
. Check that the Operator pod is running to look at the pods under the namespace by running the following command:
19+
+
20+
[source,terminal]
21+
----
22+
$ oc get pods -n nvidia-gpu-operator
23+
----
24+
+
25+
.Example output
26+
[source,terminal]
27+
----
28+
NAME READY STATUS RESTARTS AGE
29+
gpu-operator-b4cb7d74-zxpwq 1/1 Running 0 32s
30+
----
31+
32+
. Create a GPU cluster policy custom resource file similar to the following example:
33+
+
34+
[source,yaml]
35+
----
36+
apiVersion: nvidia.com/v1
37+
kind: ClusterPolicy
38+
metadata:
39+
name: gpu-cluster-policy
40+
spec:
41+
vgpuDeviceManager:
42+
config:
43+
default: default
44+
enabled: true
45+
migManager:
46+
config:
47+
default: all-disabled
48+
name: default-mig-parted-config
49+
enabled: true
50+
operator:
51+
defaultRuntime: crio
52+
initContainer: {}
53+
runtimeClass: nvidia
54+
use_ocp_driver_toolkit: true
55+
dcgm:
56+
enabled: true
57+
gfd:
58+
enabled: true
59+
dcgmExporter:
60+
config:
61+
name: ''
62+
serviceMonitor:
63+
enabled: true
64+
enabled: true
65+
cdi:
66+
default: false
67+
enabled: false
68+
driver:
69+
licensingConfig:
70+
nlsEnabled: true
71+
configMapName: ''
72+
certConfig:
73+
name: ''
74+
rdma:
75+
enabled: false
76+
kernelModuleConfig:
77+
name: ''
78+
upgradePolicy:
79+
autoUpgrade: true
80+
drain:
81+
deleteEmptyDir: false
82+
enable: false
83+
force: false
84+
timeoutSeconds: 300
85+
maxParallelUpgrades: 1
86+
maxUnavailable: 25%
87+
podDeletion:
88+
deleteEmptyDir: false
89+
force: false
90+
timeoutSeconds: 300
91+
waitForCompletion:
92+
timeoutSeconds: 0
93+
repoConfig:
94+
configMapName: ''
95+
virtualTopology:
96+
config: ''
97+
enabled: true
98+
useNvidiaDriverCRD: false
99+
useOpenKernelModules: true
100+
devicePlugin:
101+
config:
102+
name: ''
103+
default: ''
104+
mps:
105+
root: /run/nvidia/mps
106+
enabled: true
107+
gdrcopy:
108+
enabled: true
109+
kataManager:
110+
config:
111+
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
112+
mig:
113+
strategy: single
114+
sandboxDevicePlugin:
115+
enabled: true
116+
validator:
117+
plugin:
118+
env:
119+
- name: WITH_WORKLOAD
120+
value: 'false'
121+
nodeStatusExporter:
122+
enabled: true
123+
daemonsets:
124+
rollingUpdate:
125+
maxUnavailable: '1'
126+
updateStrategy: RollingUpdate
127+
sandboxWorkloads:
128+
defaultWorkload: container
129+
enabled: false
130+
gds:
131+
enabled: true
132+
image: nvidia-fs
133+
version: 2.20.5
134+
repository: nvcr.io/nvidia/cloud-native
135+
vgpuManager:
136+
enabled: false
137+
vfioManager:
138+
enabled: true
139+
toolkit:
140+
installDir: /usr/local/nvidia
141+
enabled: true
142+
----
143+
144+
. When the GPU `ClusterPolicy` custom resource has generated, create the resource on the cluster by running the following command:
145+
+
146+
[source,terminal]
147+
----
148+
$ oc create -f gpu-cluster-policy.yaml
149+
----
150+
+
151+
.Example output
152+
[source,terminal]
153+
----
154+
clusterpolicy.nvidia.com/gpu-cluster-policy created
155+
----
156+
157+
. Validate that the Operator is installed and running by running the following command:
158+
+
159+
[source,terminal]
160+
----
161+
$ oc get pods -n nvidia-gpu-operator
162+
----
163+
+
164+
.Example output
165+
[source,terminal]
166+
----
167+
NAME READY STATUS RESTARTS AGE
168+
gpu-feature-discovery-d5ngn 1/1 Running 0 3m20s
169+
gpu-feature-discovery-z42rx 1/1 Running 0 3m23s
170+
gpu-operator-6bb4d4b4c5-njh78 1/1 Running 0 4m35s
171+
nvidia-container-toolkit-daemonset-bkh8l 1/1 Running 0 3m20s
172+
nvidia-container-toolkit-daemonset-c4hzm 1/1 Running 0 3m23s
173+
nvidia-cuda-validator-4blvg 0/1 Completed 0 106s
174+
nvidia-cuda-validator-tw8sl 0/1 Completed 0 112s
175+
nvidia-dcgm-exporter-rrw4g 1/1 Running 0 3m20s
176+
nvidia-dcgm-exporter-xc78t 1/1 Running 0 3m23s
177+
nvidia-dcgm-nvxpf 1/1 Running 0 3m20s
178+
nvidia-dcgm-snj4j 1/1 Running 0 3m23s
179+
nvidia-device-plugin-daemonset-fk2xz 1/1 Running 0 3m23s
180+
nvidia-device-plugin-daemonset-wq87j 1/1 Running 0 3m20s
181+
nvidia-driver-daemonset-416.94.202410211619-0-ngrjg 4/4 Running 0 3m58s
182+
nvidia-driver-daemonset-416.94.202410211619-0-tm4x6 4/4 Running 0 3m58s
183+
nvidia-node-status-exporter-jlzxh 1/1 Running 0 3m57s
184+
nvidia-node-status-exporter-zjffs 1/1 Running 0 3m57s
185+
nvidia-operator-validator-l49hx 1/1 Running 0 3m20s
186+
nvidia-operator-validator-n44nn 1/1 Running 0 3m23s
187+
----
188+
189+
. Optional: When you have verified the pods are running, remote shell into the NVIDIA driver daemonset pod and confirm that the NVIDIA modules are loaded. Specifically, ensure the `nvidia_peermem` is loaded.
190+
+
191+
[source,terminal]
192+
----
193+
$ oc rsh -n nvidia-gpu-operator $(oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver)
194+
sh-4.4# lsmod|grep nvidia
195+
----
196+
+
197+
.Example output
198+
[source,terminal]
199+
----
200+
nvidia_fs 327680 0
201+
nvidia_peermem 24576 0
202+
nvidia_modeset 1507328 0
203+
video 73728 1 nvidia_modeset
204+
nvidia_uvm 6889472 8
205+
nvidia 8810496 43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
206+
ib_uverbs 217088 3 nvidia_peermem,rdma_ucm,mlx5_ib
207+
drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200
208+
----
209+
210+
. Optional: Run the `nvidia-smi` utility to show the details about the driver and the hardware:
211+
[source,terminal]
212+
----
213+
sh-4.4# nvidia-smi
214+
----
215+
+
216+
.Example output
217+
[source,terminal]
218+
----
219+
Wed Nov 6 22:03:53 2024
220+
+-----------------------------------------------------------------------------------------+
221+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
222+
|-----------------------------------------+------------------------+----------------------+
223+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
224+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
225+
| | | MIG M. |
226+
|=========================================+========================+======================|
227+
| 0 NVIDIA A40 On | 00000000:61:00.0 Off | 0 |
228+
| 0% 37C P0 88W / 300W | 1MiB / 46068MiB | 0% Default |
229+
| | | N/A |
230+
+-----------------------------------------+------------------------+----------------------+
231+
| 1 NVIDIA A40 On | 00000000:E1:00.0 Off | 0 |
232+
| 0% 28C P8 29W / 300W | 1MiB / 46068MiB | 0% Default |
233+
| | | N/A |
234+
+-----------------------------------------+------------------------+----------------------+
235+
236+
+-----------------------------------------------------------------------------------------+
237+
| Processes: |
238+
| GPU GI CI PID Type Process name GPU Memory |
239+
| ID ID Usage |
240+
|=========================================================================================|
241+
| No running processes found |
242+
+-----------------------------------------------------------------------------------------+
243+
----
244+
245+
. While you are still in the driver pod, set the GPU clock to maximum using the `nvidia-smi` command:
246+
+
247+
[source,terminal]
248+
----
249+
$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202410172137-0-ndhzc
250+
sh-4.4# nvidia-smi -i 0 -lgc $(nvidia-smi -i 0 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
251+
----
252+
+
253+
.Example output
254+
[source,terminal]
255+
----
256+
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:61:00.0
257+
All done.
258+
----
259+
+
260+
[source,terminal]
261+
----
262+
sh-4.4# nvidia-smi -i 1 -lgc $(nvidia-smi -i 1 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
263+
----
264+
+
265+
.Example output
266+
[source,terminal]
267+
----
268+
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:E1:00.0
269+
All done.
270+
----
271+
272+
. Validate the resource is available from a node describe perspective by running the following command:
273+
+
274+
[source,terminal]
275+
----
276+
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A9
277+
----
278+
+
279+
.Example output
280+
[source,terminal]
281+
----
282+
Capacity:
283+
cpu: 128
284+
ephemeral-storage: 1561525616Ki
285+
hugepages-1Gi: 0
286+
hugepages-2Mi: 0
287+
memory: 263596712Ki
288+
nvidia.com/gpu: 2
289+
pods: 250
290+
rdma/rdma_shared_device_eth: 63
291+
rdma/rdma_shared_device_ib: 63
292+
Allocatable:
293+
cpu: 127500m
294+
ephemeral-storage: 1438028263499
295+
hugepages-1Gi: 0
296+
hugepages-2Mi: 0
297+
memory: 262445736Ki
298+
nvidia.com/gpu: 2
299+
pods: 250
300+
rdma/rdma_shared_device_eth: 63
301+
rdma/rdma_shared_device_ib: 63
302+
--
303+
Capacity:
304+
cpu: 128
305+
ephemeral-storage: 1561525616Ki
306+
hugepages-1Gi: 0
307+
hugepages-2Mi: 0
308+
memory: 263596672Ki
309+
nvidia.com/gpu: 2
310+
pods: 250
311+
rdma/rdma_shared_device_eth: 63
312+
rdma/rdma_shared_device_ib: 63
313+
Allocatable:
314+
cpu: 127500m
315+
ephemeral-storage: 1438028263499
316+
hugepages-1Gi: 0
317+
hugepages-2Mi: 0
318+
memory: 262445696Ki
319+
nvidia.com/gpu: 2
320+
pods: 250
321+
rdma/rdma_shared_device_eth: 63
322+
rdma/rdma_shared_device_ib: 63
323+
----

0 commit comments

Comments
 (0)