-
Couldn't load subscription status.
- Fork 835
feat: support for managing gpu enabled self runner infra #2762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support for managing gpu enabled self runner infra #2762
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Pull Request Test Coverage Report for Build 17456091730Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
|
/ok-to-test |
|
/rerun |
|
/rerun-all |
IIUC, that does not work fine. /retest |
f5326fe to
7373b95
Compare
|
/retest |
0781936 to
55ac00a
Compare
|
/ok-to-test-gpu-runner |
|
/label ok-to-test-gpu-runner |
|
@andreyvelich: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
a2db60c to
b7b69e1
Compare
54205d2 to
37ba59b
Compare
|
Hi @varodrig @Electronic-Waste @astefanutti Can you please review this? |
.github/workflows/test-e2e-gpu.yaml
Outdated
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| kubernetes-version: ["1.33.1"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be the latest as of now 1.34.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
| ) | ||
|
|
||
| # TODO (andreyvelich): Discuss how we want to pre-load runtime images to the Kind cluster. | ||
| TORCH_RUNTIME_IMAGE=pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich is this something we want to keep?
It might be easy to forget to update it when we'll update the runtime over time. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we should parse this information from the kustomize manifests: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/runtimes/kustomization.yaml
@jaiakash For your E2Es you don't need Torch runtime, we probably need to build TorchTune trainer, similar to controller manager image: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/runtimes/kustomization.yaml#L8-L9.
We can do that in the followup PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will add this to existing issue for enhancement of this.
|
/lgtm Thanks @jaiakash, awesome work! |
|
/cc @varodrig @Electronic-Waste |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @jaiakash!
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* feat: support for creating and managing gpu cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile bug Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * add: ci action to ask maintainers to add label to when changes are detected Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: fixed issues and cleanup Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: run check on change in pr Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * feat: added seperate workflow for gpu runner Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: deepspeed typo Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: add gpu label on PR without merging Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: merged into single action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fixL run runner as soon as label is added Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use gpu runner when label exist Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: revert changes and fix script permission Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: create gpu supported gpu Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia issue Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: gpu cluster and torchtune model Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebookpath and delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * tmp fix: notebook to use k8s client Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use akash sdk and fix notenook size Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebook error Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster before creating one and notebook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: kube config Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile add comment Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia runtime Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: disable e2e go Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: temporarly use my personal token Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: refactored code Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: take hf token from env of self runner vm Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: to run notebook directly Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * refactor: torchtune job Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: pre commit hook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rename ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * rem: delete cluster command from makefile Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * update: upgrade k8s to 1.34.0 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> --------- Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
* feat: support for creating and managing gpu cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile bug Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * add: ci action to ask maintainers to add label to when changes are detected Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: fixed issues and cleanup Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: run check on change in pr Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * feat: added seperate workflow for gpu runner Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: deepspeed typo Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: add gpu label on PR without merging Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: merged into single action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fixL run runner as soon as label is added Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use gpu runner when label exist Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: revert changes and fix script permission Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: create gpu supported gpu Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia issue Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: gpu cluster and torchtune model Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebookpath and delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * tmp fix: notebook to use k8s client Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use akash sdk and fix notenook size Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebook error Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster before creating one and notebook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: kube config Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile add comment Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia runtime Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: disable e2e go Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: temporarly use my personal token Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: refactored code Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: take hf token from env of self runner vm Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: to run notebook directly Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * refactor: torchtune job Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: pre commit hook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rename ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * rem: delete cluster command from makefile Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * update: upgrade k8s to 1.34.0 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> --------- Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
* feat: support for creating and managing gpu cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile bug Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * add: ci action to ask maintainers to add label to when changes are detected Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: fixed issues and cleanup Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: run check on change in pr Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * feat: added seperate workflow for gpu runner Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: deepspeed typo Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: add gpu label on PR without merging Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: merged into single action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fixL run runner as soon as label is added Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use gpu runner when label exist Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: revert changes and fix script permission Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: create gpu supported gpu Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia issue Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: gpu cluster and torchtune model Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebookpath and delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * tmp fix: notebook to use k8s client Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use akash sdk and fix notenook size Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebook error Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster before creating one and notebook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: kube config Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile add comment Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia runtime Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: disable e2e go Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: temporarly use my personal token Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: refactored code Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: take hf token from env of self runner vm Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: to run notebook directly Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * refactor: torchtune job Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: pre commit hook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rename ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * rem: delete cluster command from makefile Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * update: upgrade k8s to 1.34.0 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> --------- Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>

Introduction:
This PR is part of Phase 1 of the GSoC 2025 Project 7 – GPU Testing for LLM Blueprints.
1. Support for creating and managing gpu cluster:
This PR adds support for running GPU-enabled workloads using an Oracle Cloud Infrastructure (OCI) GPU-based VM as a self-hosted GitHub Actions runner. The goal is to enable execution of LLM blueprint examples with GPU enabled cluster hosted on Oracle Cloud.
kindwithnvkindto deploy a GPU-capable Kubernetes cluster locally.2. CI Workflow Integration
trainer/examples/directory.Maintainers for this CI Action:
@andreyvelich, @varodrig, @jaiakash
What this PR does / why we need it:
Related #2674 #2432
Checklist: