Description
NVMe devices have a fixed number of interrupt vectors (IVs). The nvme_driver
creates one IoIssuer and IO Queue Pair per interrupt vector. When the number of IVs is less than the number of vCPUs, some vCPUs need to share the same IOQP and IV. The nvme_driver
creates them in a greedy fashion, based on the CPU on which the IO was issued by the guest. In certain guest workloads. this means that each NVMe device in OpenHCL can overlap on a relatively small subset of CPUs. This problem becomes amplified when NVMe devices are used to support a striped disk: a single IO (say a write
on CPU0), can cause multiple NVMe devices to create an IO issuer on CPU 0.
We should change the algorithm to not overload a subset of CPUs. Some options:
- Option 1: the vtl2 settings worker knows now many nvme devices an OpenHCL VM will have. When those settings are supplied, create a global cap. Respect that cap in the
nvme_driver
. - Option 2: the
nvme_driver
can keep a tally of the number of IVs assigned to any given CPU. When the max vs. min count becomes too great, then find a "close" CPU. - Option 3: The supported number of IVs and vCPUs is known. "spread out" the IVs across CPUs by generating a stride. Be careful not to always start this stride on the same CPU for all nvme devices.
In general, fallback to close CPUs should try to preserve NUMA locality.
Reported-By: @fliang-ms