Skip to content

vGPU count on single GPU is over the splitcount limit #1330

@climby

Description

@climby

What happened:

create 1vGPU request pod with 10 concurrents/sec.
the vGPU distribution on each GPU is not balance. Here is the metrics from hame-scheduler:31993/metrics

`

HELP GPUDeviceSharedNum Number of containers sharing this GPU

TYPE GPUDeviceSharedNum gauge

GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-17a8e6e5-16c0-621d-e674-7ffc89498e4d",nodeid="gn-10-245-36-20",zone="vGPU"} 11
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-c780634e-8a90-c7d6-4da2-3f400a364ba8",nodeid="gn-10-245-36-34",zone="vGPU"} 9
GPUDeviceSharedNum{deviceidx="1",deviceuuid="GPU-191941e0-7b6d-34c4-4e69-72169b3ff7fe",nodeid="gn-10-245-36-20",zone="vGPU"} 9
GPUDeviceSharedNum{deviceidx="1",deviceuuid="GPU-94d5df55-a5f6-a0d9-3d52-f782ab39373a",nodeid="gn-10-245-36-34",zone="vGPU"} 10
GPUDeviceSharedNum{deviceidx="2",deviceuuid="GPU-34d69f5c-7b8e-87a8-5678-e00c9eecd4a5",nodeid="gn-10-245-36-20",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="2",deviceuuid="GPU-feec1ca5-d9cf-4d8b-e87b-a7a3f3155d7f",nodeid="gn-10-245-36-34",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="3",deviceuuid="GPU-0778ac95-7f1f-2150-a1ad-5bda2266f6a5",nodeid="gn-10-245-36-20",zone="vGPU"} 11
GPUDeviceSharedNum{deviceidx="3",deviceuuid="GPU-e1d81879-e979-7d5a-1c8b-c2582826dfd6",nodeid="gn-10-245-36-34",zone="vGPU"} 5
GPUDeviceSharedNum{deviceidx="4",deviceuuid="GPU-8e04e647-05f1-b1bf-c2f7-184b0eb41c11",nodeid="gn-10-245-36-34",zone="vGPU"} 9
GPUDeviceSharedNum{deviceidx="4",deviceuuid="GPU-d101350f-086d-8154-5028-0facce561fe3",nodeid="gn-10-245-36-20",zone="vGPU"} 5
GPUDeviceSharedNum{deviceidx="5",deviceuuid="GPU-4c23ba1d-e307-42e8-9a35-381c27f2df30",nodeid="gn-10-245-36-20",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="5",deviceuuid="GPU-c45a60d6-ac5d-aa22-ec33-a8c23a6c4901",nodeid="gn-10-245-36-34",zone="vGPU"} 7
GPUDeviceSharedNum{deviceidx="6",deviceuuid="GPU-ba775889-6b3c-8637-5117-8df7578bd70d",nodeid="gn-10-245-36-34",zone="vGPU"} 12
GPUDeviceSharedNum{deviceidx="6",deviceuuid="GPU-f97b3ce6-bdfb-a947-7466-435f2e4cb15f",nodeid="gn-10-245-36-20",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="7",deviceuuid="GPU-075b5d7f-c0ed-a221-c8d4-4ef830a74a3a",nodeid="gn-10-245-36-20",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="7",deviceuuid="GPU-71dc5970-53ea-7bec-5554-00eedc3bd16e",nodeid="gn-10-245-36-34",zone="vGPU"} 4

`

the vGPU split count in config is 10, But we can there are more than 10 vGPUs were assign to deviceuuid="GPU-17a8e6e5-16c0-621d-e674-7ffc89498e4d" , deviceuuid="GPU-0778ac95-7f1f-2150-a1ad-5bda2266f6a5"
deviceuuid="GPU-ba775889-6b3c-8637-5117-8df7578bd70d"

It was happened when creating pod with high concurrent parallel

What you expected to happen:

the vGPUs were distributed to each GPU, and the vGPU count were not over the split count in config file on each GPU

How to reproduce it (as minimally and precisely as possible):

  • 3 replicates for hame-sechulder deployment. ( 1 replicate would be down often)
  • create 1vGPU pod with 10 concurrent processes

Anything else we need to know?:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
    I0908 10:58:34.400604 1 schedule_one.go:84] "About to try and schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" I0908 10:58:34.400611 1 schedule_one.go:97] "Attempting to schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" I0908 10:58:34.400814 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" node="gn-10-245-36-34" I0908 10:58:34.400828 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" node="gn-10-245-36-20" E0908 10:58:34.401082 1 schedule_one.go:161] "Error selecting node for pod" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" E0908 10:58:34.401101 1 schedule_one.go:1046] "Error scheduling pod; retrying" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" I0908 10:58:34.401160 1 schedule_one.go:1111] "Updating pod condition" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError" I0908 10:58:34.401178 1 schedule_one.go:84] "About to try and schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" I0908 10:58:34.401184 1 schedule_one.go:97] "Attempting to schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" I0908 10:58:34.401346 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" node="gn-10-245-36-20" I0908 10:58:34.401348 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" node="gn-10-245-36-34" E0908 10:58:34.401614 1 schedule_one.go:161] "Error selecting node for pod" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" E0908 10:58:34.401634 1 schedule_one.go:1046] "Error scheduling pod; retrying" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" I0908 10:58:34.401692 1 schedule_one.go:1111] "Updating pod condition" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError" I0908 10:58:34.401713 1 schedule_one.go:84] "About to try and schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" I0908 10:58:34.401720 1 schedule_one.go:97] "Attempting to schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" I0908 10:58:34.401928 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" node="gn-10-245-36-20" I0908 10:58:34.401950 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" node="gn-10-245-36-34" E0908 10:58:34.402196 1 schedule_one.go:161] "Error selecting node for pod" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" E0908 10:58:34.402213 1 schedule_one.go:1046] "Error scheduling pod; retrying" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" I0908 10:58:34.402262 1 schedule_one.go:1111] "Updating pod condition" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError" I0908 10:58:34.402281 1 schedule_one.go:84] "About to try and schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" I0908 10:58:34.402287 1 schedule_one.go:97] "Attempting to schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" I0908 10:58:34.402470 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" node="gn-10-245-36-34" I0908 10:58:34.402471 1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" node="gn-10-245-36-20" E0908 10:58:34.402728 1 schedule_one.go:161] "Error selecting node for pod" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" E0908 10:58:34.402745 1 schedule_one.go:1046] "Error scheduling pod; retrying" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" I0908 10:58:34.402808 1 schedule_one.go:1111] "Updating pod condition" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError" [maodf@nuc2
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version:
    V2.6.1
  • nvidia driver or other AI device driver version:
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions