Skip to content

RFC / Feature suggestion - "Magic queues" #462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MitchellAugustin opened this issue Feb 13, 2025 · 1 comment
Open

RFC / Feature suggestion - "Magic queues" #462

MitchellAugustin opened this issue Feb 13, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@MitchellAugustin
Copy link
Contributor

Background:

  • PE has developed an internal tool that can easily spawn $(numGPUs) VMs, each with a GPU passed-through to it. Each VM is added as a new machine to the same MAAS instance as its host. This is useful when we'd like to parallelize the execution of separate jobs that each only require a single GPU, such as a GPU driver load test on various different Ubuntu version/kernel combinations, or a large batch of non-intensive GPU-requiring smoke tests. Previously, PE selected an 8-GPU host to be a permanent VM host, but this ephemeral method is more versatile since it allows us to adjust how many VMs or directly-provisionable bare metal hosts are available day-to-day, depending on project need.
    • I think this type of "dynamic queue" could be broadly valuable, but today it is not easily streamlined via testflinger - we need to run the "VM splitter" scripts after reserving the host via TF, which then creates new, non-Testflinger-managed MAAS machine instances representing each VM, which require additional automation to integrate with our infrastructure.

Proposal: "Magic queues"

  • User story: Consider a "magic queue" named "one-gpu-vm". When a user submits a job to this queue:
    • If no agents are attached to the "one-gpu-vm" queue and a matching bare metal host is in "waiting" state, take down the bare metal host's agent, provision the host in MAAS, then run the "vm splitter" scripts on it. This will enroll $(numGPUs) VMs w/ passthrough in the same MAAS instance as the host.
      • After the VMs are commissioned in MAAS, TF adds each VM as an agent to the "one-gpu-vm" queue, allowing the queued job(s) to begin.
    • If there are agents attached to the "one-gpu-vm" queue in waiting state, run the job on one of the idle agents
    • If there are agents attached to the "one-gpu-vm", but all agents are occupied:
      • If another matching BM host is in waiting state, provision it as a VM host as described above and attach its VMs as agents to the "one-gpu-vm" queue
      • If all other BM hosts are occupied, add job to "one-gpu-vm" queue

Once there are no jobs remaining on any of a bare-metal host's agents, detach each of its agents from the "one-gpu-vm" queue, destroy the MAAS machine instances for its VMs, release the bare-metal host, and re-attach the bare-metal host's agent to its queues.

This is very much a wishlist item for a future cycle that would require additional design discussions if it might be viable, so consider this issue as more of a request for comment. (also let me know if there's a more appropriate place for this.) I'd like to know if, at a high level, the team thinks this is a feature that would make sense / be feasible to implement.

Copy link

Thank you for reporting your feedback to us!

The internal ticket has been created: https://warthogs.atlassian.net/browse/CERTTF-502.

This message was autogenerated

@pedro-avalos pedro-avalos added the enhancement New feature or request label Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants