RFC / Feature suggestion - "Magic queues" #462

MitchellAugustin · 2025-02-13T21:01:51Z

Background:

PE has developed an internal tool that can easily spawn $(numGPUs) VMs, each with a GPU passed-through to it. Each VM is added as a new machine to the same MAAS instance as its host. This is useful when we'd like to parallelize the execution of separate jobs that each only require a single GPU, such as a GPU driver load test on various different Ubuntu version/kernel combinations, or a large batch of non-intensive GPU-requiring smoke tests. Previously, PE selected an 8-GPU host to be a permanent VM host, but this ephemeral method is more versatile since it allows us to adjust how many VMs or directly-provisionable bare metal hosts are available day-to-day, depending on project need.
- I think this type of "dynamic queue" could be broadly valuable, but today it is not easily streamlined via testflinger - we need to run the "VM splitter" scripts after reserving the host via TF, which then creates new, non-Testflinger-managed MAAS machine instances representing each VM, which require additional automation to integrate with our infrastructure.

Proposal: "Magic queues"

User story: Consider a "magic queue" named "one-gpu-vm". When a user submits a job to this queue:
- If no agents are attached to the "one-gpu-vm" queue and a matching bare metal host is in "waiting" state, take down the bare metal host's agent, provision the host in MAAS, then run the "vm splitter" scripts on it. This will enroll $(numGPUs) VMs w/ passthrough in the same MAAS instance as the host.
  - After the VMs are commissioned in MAAS, TF adds each VM as an agent to the "one-gpu-vm" queue, allowing the queued job(s) to begin.
- If there are agents attached to the "one-gpu-vm" queue in waiting state, run the job on one of the idle agents
- If there are agents attached to the "one-gpu-vm", but all agents are occupied:
  - If another matching BM host is in waiting state, provision it as a VM host as described above and attach its VMs as agents to the "one-gpu-vm" queue
  - If all other BM hosts are occupied, add job to "one-gpu-vm" queue

Once there are no jobs remaining on any of a bare-metal host's agents, detach each of its agents from the "one-gpu-vm" queue, destroy the MAAS machine instances for its VMs, release the bare-metal host, and re-attach the bare-metal host's agent to its queues.

This is very much a wishlist item for a future cycle that would require additional design discussions if it might be viable, so consider this issue as more of a request for comment. (also let me know if there's a more appropriate place for this.) I'd like to know if, at a high level, the team thinks this is a feature that would make sense / be feasible to implement.

syncronize-issues-to-jira · 2025-02-13T21:02:01Z

Thank you for reporting your feedback to us!

The internal ticket has been created: https://warthogs.atlassian.net/browse/CERTTF-502.

This message was autogenerated

pedro-avalos added the enhancement label Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC / Feature suggestion - "Magic queues" #462

RFC / Feature suggestion - "Magic queues" #462

MitchellAugustin commented Feb 13, 2025

syncronize-issues-to-jira bot commented Feb 13, 2025

RFC / Feature suggestion - "Magic queues" #462

RFC / Feature suggestion - "Magic queues" #462

Comments

MitchellAugustin commented Feb 13, 2025

syncronize-issues-to-jira bot commented Feb 13, 2025