vGPU count on single GPU  is over the splitcount limit

###  **What happened**:

create 1vGPU request pod with  10 concurrents/sec.  
the vGPU distribution on each GPU is not balance.  Here is the metrics from hame-scheduler:31993/metrics
-----------------------

`

# HELP GPUDeviceSharedNum Number of containers sharing this GPU
# TYPE GPUDeviceSharedNum gauge
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-17a8e6e5-16c0-621d-e674-7ffc89498e4d",nodeid="gn-10-245-36-20",zone="vGPU"} 11
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-c780634e-8a90-c7d6-4da2-3f400a364ba8",nodeid="gn-10-245-36-34",zone="vGPU"} 9
GPUDeviceSharedNum{deviceidx="1",deviceuuid="GPU-191941e0-7b6d-34c4-4e69-72169b3ff7fe",nodeid="gn-10-245-36-20",zone="vGPU"} 9
GPUDeviceSharedNum{deviceidx="1",deviceuuid="GPU-94d5df55-a5f6-a0d9-3d52-f782ab39373a",nodeid="gn-10-245-36-34",zone="vGPU"} 10
GPUDeviceSharedNum{deviceidx="2",deviceuuid="GPU-34d69f5c-7b8e-87a8-5678-e00c9eecd4a5",nodeid="gn-10-245-36-20",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="2",deviceuuid="GPU-feec1ca5-d9cf-4d8b-e87b-a7a3f3155d7f",nodeid="gn-10-245-36-34",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="3",deviceuuid="GPU-0778ac95-7f1f-2150-a1ad-5bda2266f6a5",nodeid="gn-10-245-36-20",zone="vGPU"} 11
GPUDeviceSharedNum{deviceidx="3",deviceuuid="GPU-e1d81879-e979-7d5a-1c8b-c2582826dfd6",nodeid="gn-10-245-36-34",zone="vGPU"} 5
GPUDeviceSharedNum{deviceidx="4",deviceuuid="GPU-8e04e647-05f1-b1bf-c2f7-184b0eb41c11",nodeid="gn-10-245-36-34",zone="vGPU"} 9
GPUDeviceSharedNum{deviceidx="4",deviceuuid="GPU-d101350f-086d-8154-5028-0facce561fe3",nodeid="gn-10-245-36-20",zone="vGPU"} 5
GPUDeviceSharedNum{deviceidx="5",deviceuuid="GPU-4c23ba1d-e307-42e8-9a35-381c27f2df30",nodeid="gn-10-245-36-20",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="5",deviceuuid="GPU-c45a60d6-ac5d-aa22-ec33-a8c23a6c4901",nodeid="gn-10-245-36-34",zone="vGPU"} 7
GPUDeviceSharedNum{deviceidx="6",deviceuuid="GPU-ba775889-6b3c-8637-5117-8df7578bd70d",nodeid="gn-10-245-36-34",zone="vGPU"} 12
GPUDeviceSharedNum{deviceidx="6",deviceuuid="GPU-f97b3ce6-bdfb-a947-7466-435f2e4cb15f",nodeid="gn-10-245-36-20",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="7",deviceuuid="GPU-075b5d7f-c0ed-a221-c8d4-4ef830a74a3a",nodeid="gn-10-245-36-20",zone="vGPU"} 8
GPUDeviceSharedNum{deviceidx="7",deviceuuid="GPU-71dc5970-53ea-7bec-5554-00eedc3bd16e",nodeid="gn-10-245-36-34",zone="vGPU"} 4

`
---------------------

the vGPU split count in config is  10,  But we can there are more than 10 vGPUs were assign to deviceuuid="GPU-17a8e6e5-16c0-621d-e674-7ffc89498e4d" , deviceuuid="GPU-0778ac95-7f1f-2150-a1ad-5bda2266f6a5"
deviceuuid="GPU-ba775889-6b3c-8637-5117-8df7578bd70d"

It was happened when creating pod with  high concurrent parallel


### **What you expected to happen**:
the vGPUs were distributed to each GPU， and the vGPU count were not over the split count in config file  on each GPU


### **How to reproduce it (as minimally and precisely as possible)**:
- 3  replicates for  hame-sechulder deployment. ( 1 replicate would be down often)
- create 1vGPU  pod with 10 concurrent processes

### **Anything else we need to know?**:

- The output of `nvidia-smi -a` on your host
- Your docker or containerd configuration file (e.g: `/etc/docker/daemon.json`)
- The hami-device-plugin container logs
- The hami-scheduler container logs
`
I0908 10:58:34.400604       1 schedule_one.go:84] "About to try and schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69"
I0908 10:58:34.400611       1 schedule_one.go:97] "Attempting to schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69"
I0908 10:58:34.400814       1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" node="gn-10-245-36-34"
I0908 10:58:34.400828       1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" node="gn-10-245-36-20"
E0908 10:58:34.401082       1 schedule_one.go:161] "Error selecting node for pod" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69"
E0908 10:58:34.401101       1 schedule_one.go:1046] "Error scheduling pod; retrying" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69"
I0908 10:58:34.401160       1 schedule_one.go:1111] "Updating pod condition" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bgv-944968d8f-5bl69" conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError"
I0908 10:58:34.401178       1 schedule_one.go:84] "About to try and schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn"
I0908 10:58:34.401184       1 schedule_one.go:97] "Attempting to schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn"
I0908 10:58:34.401346       1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" node="gn-10-245-36-20"
I0908 10:58:34.401348       1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" node="gn-10-245-36-34"
E0908 10:58:34.401614       1 schedule_one.go:161] "Error selecting node for pod" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn"
E0908 10:58:34.401634       1 schedule_one.go:1046] "Error scheduling pod; retrying" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn"
I0908 10:58:34.401692       1 schedule_one.go:1111] "Updating pod condition" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000bf0-b96b766bf-t6qrn" conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError"
I0908 10:58:34.401713       1 schedule_one.go:84] "About to try and schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv"
I0908 10:58:34.401720       1 schedule_one.go:97] "Attempting to schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv"
I0908 10:58:34.401928       1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" node="gn-10-245-36-20"
I0908 10:58:34.401950       1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" node="gn-10-245-36-34"
E0908 10:58:34.402196       1 schedule_one.go:161] "Error selecting node for pod" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv"
E0908 10:58:34.402213       1 schedule_one.go:1046] "Error scheduling pod; retrying" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv"
I0908 10:58:34.402262       1 schedule_one.go:1111] "Updating pod condition" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000acw-d9fbfb958-vlljv" conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError"
I0908 10:58:34.402281       1 schedule_one.go:84] "About to try and schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr"
I0908 10:58:34.402287       1 schedule_one.go:97] "Attempting to schedule pod" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr"
I0908 10:58:34.402470       1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" node="gn-10-245-36-34"
I0908 10:58:34.402471       1 binder.go:895] "All bound volumes for pod match with node" logger="FilterWithNominatedPods.Filter.VolumeBinding" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" node="gn-10-245-36-20"
E0908 10:58:34.402728       1 schedule_one.go:161] "Error selecting node for pod" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr"
E0908 10:58:34.402745       1 schedule_one.go:1046] "Error scheduling pod; retrying" err="Post \"https://127.0.0.1:443/filter\": dial tcp 127.0.0.1:443: connect: connection refused" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr"
I0908 10:58:34.402808       1 schedule_one.go:1111] "Updating pod condition" pod="pod-prod/p-a5759d6cd5c1-ackcs-00000ab9-6f8f6b6977-zvbjr" conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError"
[maodf@nuc2
`
- The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)
- Any relevant kernel output lines from `dmesg`

**Environment**:
- HAMi version:
V2.6.1
- nvidia driver or other AI device driver version:
- Docker version from `docker version`
- Docker command, image and tag used
- Kernel version from `uname -a`
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vGPU count on single GPU is over the splitcount limit #1330

What happened:

create 1vGPU request pod with 10 concurrents/sec.
the vGPU distribution on each GPU is not balance. Here is the metrics from hame-scheduler:31993/metrics

HELP GPUDeviceSharedNum Number of containers sharing this GPU

TYPE GPUDeviceSharedNum gauge

`

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vGPU count on single GPU is over the splitcount limit #1330

Description

What happened:

create 1vGPU request pod with 10 concurrents/sec. the vGPU distribution on each GPU is not balance. Here is the metrics from hame-scheduler:31993/metrics

HELP GPUDeviceSharedNum Number of containers sharing this GPU

TYPE GPUDeviceSharedNum gauge

`

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

create 1vGPU request pod with 10 concurrents/sec.
the vGPU distribution on each GPU is not balance. Here is the metrics from hame-scheduler:31993/metrics