-
Notifications
You must be signed in to change notification settings - Fork 372
Description
What happened:
create 1vGPU request pod with 10 concurrents/sec.
the vGPU distribution on each GPU is not balance. Here is the metrics from hame-scheduler:31993/metrics
`
HELP GPUDeviceSharedNum Number of containers sharing this GPU
TYPE GPUDeviceSharedNum gauge
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-17a8e6e5-16c0-621d-e674-7ffc89498e4d",nodeid="gn-10-245-36-20",zone="vGPU"} 11
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-c780634e-8a90-c7d6-4da2-3f400a364ba8",nodeid="gn-10-245-36-34",zone="vGPU"} 9
GPUDeviceSharedNum{deviceidx="1",deviceuuid="GPU-191941e0-7b6d-34c4-4e69-72169b3ff7fe",nodeid="gn-10-245-36-20",zone="vGPU"} 9
GPUDeviceSharedNum{deviceidx="1",deviceuuid="GPU-94d5df55-a5f6-a0d9-3d52-f782ab39373a",nodeid="gn-10-245-36-34",zone="vGPU"} 10
GPUDeviceSharedNum{deviceidx="2",deviceuuid="GPU-34d69f5c-7b8e-87a8-5678-e00c9eecd4a5",nodeid="gn-10-245-36-20",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="2",deviceuuid="GPU-feec1ca5-d9cf-4d8b-e87b-a7a3f3155d7f",nodeid="gn-10-245-36-34",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="3",deviceuuid="GPU-0778ac95-7f1f-2150-a1ad-5bda2266f6a5",nodeid="gn-10-245-36-20",zone="vGPU"} 11
GPUDeviceSharedNum{deviceidx="3",deviceuuid="GPU-e1d81879-e979-7d5a-1c8b-c2582826dfd6",nodeid="gn-10-245-36-34",zone="vGPU"} 5
GPUDeviceSharedNum{deviceidx="4",deviceuuid="GPU-8e04e647-05f1-b1bf-c2f7-184b0eb41c11",nodeid="gn-10-245-36-34",zone="vGPU"} 9
GPUDeviceSharedNum{deviceidx="4",deviceuuid="GPU-d101350f-086d-8154-5028-0facce561fe3",nodeid="gn-10-245-36-20",zone="vGPU"} 5
GPUDeviceSharedNum{deviceidx="5",deviceuuid="GPU-4c23ba1d-e307-42e8-9a35-381c27f2df30",nodeid="gn-10-245-36-20",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="5",deviceuuid="GPU-c45a60d6-ac5d-aa22-ec33-a8c23a6c4901",nodeid="gn-10-245-36-34",zone="vGPU"} 7
GPUDeviceSharedNum{deviceidx="6",deviceuuid="GPU-ba775889-6b3c-8637-5117-8df7578bd70d",nodeid="gn-10-245-36-34",zone="vGPU"} 12
GPUDeviceSharedNum{deviceidx="6",deviceuuid="GPU-f97b3ce6-bdfb-a947-7466-435f2e4cb15f",nodeid="gn-10-245-36-20",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="7",deviceuuid="GPU-075b5d7f-c0ed-a221-c8d4-4ef830a74a3a",nodeid="gn-10-245-36-20",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="7",deviceuuid="GPU-71dc5970-53ea-7bec-5554-00eedc3bd16e",nodeid="gn-10-245-36-34",zone="vGPU"} 4
`
the vGPU split count in config is 10, But we can there are more than 10 vGPUs were assign to deviceuuid="GPU-17a8e6e5-16c0-621d-e674-7ffc89498e4d" , deviceuuid="GPU-0778ac95-7f1f-2150-a1ad-5bda2266f6a5"
deviceuuid="GPU-ba775889-6b3c-8637-5117-8df7578bd70d"
It was happened when creating pod with high concurrent parallel
What you expected to happen:
the vGPUs were distributed to each GPU, and the vGPU count were not over the split count in config file on each GPU
How to reproduce it (as minimally and precisely as possible):
- 3 replicates for hame-sechulder deployment. ( 1 replicate would be down often)
- create 1vGPU pod with 10 concurrent processes
Anything else we need to know?:
- The output of
nvidia-smi -a
on your host - Your docker or containerd configuration file (e.g:
/etc/docker/daemon.json
) - The hami-device-plugin container logs
- The hami-scheduler container logs
I0908 10:58:34.400604 1 schedule_one.go:84] "About to try and schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" I0908 10:58:34.400611 1 schedule_one.go:97] "Attempting to schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" I0908 10:58:34.400814 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" node="gn-10-245-36-34" I0908 10:58:34.400828 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" node="gn-10-245-36-20" E0908 10:58:34.401082 1 schedule_one.go:161] "Error selecting node for pod" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" E0908 10:58:34.401101 1 schedule_one.go:1046] "Error scheduling pod; retrying" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" I0908 10:58:34.401160 1 schedule_one.go:1111] "Updating pod condition" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError" I0908 10:58:34.401178 1 schedule_one.go:84] "About to try and schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" I0908 10:58:34.401184 1 schedule_one.go:97] "Attempting to schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" I0908 10:58:34.401346 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" node="gn-10-245-36-20" I0908 10:58:34.401348 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" node="gn-10-245-36-34" E0908 10:58:34.401614 1 schedule_one.go:161] "Error selecting node for pod" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" E0908 10:58:34.401634 1 schedule_one.go:1046] "Error scheduling pod; retrying" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" I0908 10:58:34.401692 1 schedule_one.go:1111] "Updating pod condition" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError" I0908 10:58:34.401713 1 schedule_one.go:84] "About to try and schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" I0908 10:58:34.401720 1 schedule_one.go:97] "Attempting to schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" I0908 10:58:34.401928 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" node="gn-10-245-36-20" I0908 10:58:34.401950 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" node="gn-10-245-36-34" E0908 10:58:34.402196 1 schedule_one.go:161] "Error selecting node for pod" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" E0908 10:58:34.402213 1 schedule_one.go:1046] "Error scheduling pod; retrying" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" I0908 10:58:34.402262 1 schedule_one.go:1111] "Updating pod condition" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError" I0908 10:58:34.402281 1 schedule_one.go:84] "About to try and schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" I0908 10:58:34.402287 1 schedule_one.go:97] "Attempting to schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" I0908 10:58:34.402470 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" node="gn-10-245-36-34" I0908 10:58:34.402471 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" node="gn-10-245-36-20" E0908 10:58:34.402728 1 schedule_one.go:161] "Error selecting node for pod" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" E0908 10:58:34.402745 1 schedule_one.go:1046] "Error scheduling pod; retrying" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" I0908 10:58:34.402808 1 schedule_one.go:1111] "Updating pod condition" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError" [maodf@nuc2
- The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
) - Any relevant kernel output lines from
dmesg
Environment:
- HAMi version:
V2.6.1 - nvidia driver or other AI device driver version:
- Docker version from
docker version
- Docker command, image and tag used
- Kernel version from
uname -a
- Others: