You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PE has developed an internal tool that can easily spawn $(numGPUs) VMs, each with a GPU passed-through to it. Each VM is added as a new machine to the same MAAS instance as its host. This is useful when we'd like to parallelize the execution of separate jobs that each only require a single GPU, such as a GPU driver load test on various different Ubuntu version/kernel combinations, or a large batch of non-intensive GPU-requiring smoke tests. Previously, PE selected an 8-GPU host to be a permanent VM host, but this ephemeral method is more versatile since it allows us to adjust how many VMs or directly-provisionable bare metal hosts are available day-to-day, depending on project need.
I think this type of "dynamic queue" could be broadly valuable, but today it is not easily streamlined via testflinger - we need to run the "VM splitter" scripts after reserving the host via TF, which then creates new, non-Testflinger-managed MAAS machine instances representing each VM, which require additional automation to integrate with our infrastructure.
Proposal: "Magic queues"
User story: Consider a "magic queue" named "one-gpu-vm". When a user submits a job to this queue:
If no agents are attached to the "one-gpu-vm" queue and a matching bare metal host is in "waiting" state, take down the bare metal host's agent, provision the host in MAAS, then run the "vm splitter" scripts on it. This will enroll $(numGPUs) VMs w/ passthrough in the same MAAS instance as the host.
After the VMs are commissioned in MAAS, TF adds each VM as an agent to the "one-gpu-vm" queue, allowing the queued job(s) to begin.
If there are agents attached to the "one-gpu-vm" queue in waiting state, run the job on one of the idle agents
If there are agents attached to the "one-gpu-vm", but all agents are occupied:
If another matching BM host is in waiting state, provision it as a VM host as described above and attach its VMs as agents to the "one-gpu-vm" queue
If all other BM hosts are occupied, add job to "one-gpu-vm" queue
Once there are no jobs remaining on any of a bare-metal host's agents, detach each of its agents from the "one-gpu-vm" queue, destroy the MAAS machine instances for its VMs, release the bare-metal host, and re-attach the bare-metal host's agent to its queues.
This is very much a wishlist item for a future cycle that would require additional design discussions if it might be viable, so consider this issue as more of a request for comment. (also let me know if there's a more appropriate place for this.) I'd like to know if, at a high level, the team thinks this is a feature that would make sense / be feasible to implement.
The text was updated successfully, but these errors were encountered:
Background:
Proposal: "Magic queues"
Once there are no jobs remaining on any of a bare-metal host's agents, detach each of its agents from the "one-gpu-vm" queue, destroy the MAAS machine instances for its VMs, release the bare-metal host, and re-attach the bare-metal host's agent to its queues.
This is very much a wishlist item for a future cycle that would require additional design discussions if it might be viable, so consider this issue as more of a request for comment. (also let me know if there's a more appropriate place for this.) I'd like to know if, at a high level, the team thinks this is a feature that would make sense / be feasible to implement.
The text was updated successfully, but these errors were encountered: