-
Notifications
You must be signed in to change notification settings - Fork 37
Description
开启volcano vgpu插件后,我们想用队列来管控vgpu资源,集群有4张gpu卡,每张卡显存15g。
1、创建如下队列
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: gpu-work
spec:
capability:
volcano.sh/vgpu-number: 3
2、首先提交了3个独占的任务
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod-5
namespace: gpu-test
annotations:
nvidia.com/use-gputype: "T4"
scheduling.volcano.sh/queue-name: gpu-work
spec:
schedulerName: volcano
containers:
- name: cuda-container
image: ezone.ksyun.com/ezone/dh_docker/snapshot/ubuntu:22.04
command: ["bash", "-c", "sleep 86400"]
args: ["100000"]
resources:
limits:
volcano.sh/vgpu-number: 1 #单位个
3、3个pod,都running,然后提交2个共享的任务(同时设置vgpu-number和vgpu-memory),
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod-5
namespace: gpu-test
annotations:
nvidia.com/use-gputype: "T4"
#scheduling.volcano.sh/queue-name: gpu-work
spec:
schedulerName: volcano
containers:
- name: cuda-container
image: ezone.ksyun.com/ezone/dh_docker/snapshot/ubuntu:22.04
command: ["bash", "-c", "sleep 86400"]
args: ["100000"]
resources:
limits:
volcano.sh/vgpu-number: 1 #单位个
volcano.sh/vgpu-memory: 5000 #单位MI
4、结果,第5个pod pending了,看起来像是队列不支持共享vgpu模式,想问下volcano里面vgpu插件和队列如何配合使用,没有看到相关文档