feat: support for managing gpu enabled self runner infra #2762

jaiakash · 2025-07-31T13:24:02Z

Introduction:

This PR is part of Phase 1 of the GSoC 2025 Project 7 – GPU Testing for LLM Blueprints.

1. Support for creating and managing gpu cluster:

This PR adds support for running GPU-enabled workloads using an Oracle Cloud Infrastructure (OCI) GPU-based VM as a self-hosted GitHub Actions runner. The goal is to enable execution of LLM blueprint examples with GPU enabled cluster hosted on Oracle Cloud.

Replaces kind with nvkind to deploy a GPU-capable Kubernetes cluster locally.
Configures the container runtime to use the NVIDIA Container Runtime for GPU scheduling.
Installs the NVIDIA GPU Operator via Helm, which provisions:
- NVIDIA Drivers
- Kubernetes Device Plugin
- DCGM Exporter for GPU telemetry

2. CI Workflow Integration

Adds a GitHub Actions workflow that triggers only when changes are made in the trainer/examples/ directory.
The job runs only after approval from one of the maintainers listed below.

Maintainers for this CI Action:
@andreyvelich, @varodrig, @jaiakash

What this PR does / why we need it:
Related #2674 #2432

Checklist:

Docs included if any changes are user facing

review-notebook-app · 2025-07-31T13:24:06Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coveralls · 2025-07-31T14:10:03Z

Pull Request Test Coverage Report for Build 17456091730

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 52.136%

Totals
Change from base Build 17414373409:	0.0%
Covered Lines:	1025
Relevant Lines:	1966

💛 - Coveralls

andreyvelich · 2025-07-31T14:32:58Z

/ok-to-test

andreyvelich · 2025-07-31T14:33:16Z

/rerun

andreyvelich · 2025-07-31T14:33:39Z

/rerun-all

tenzen-y · 2025-07-31T14:54:16Z

/rerun-all

IIUC, that does not work fine.

/retest

tenzen-y · 2025-07-31T14:55:57Z

/rerun-all

IIUC, that does not work fine.

/retest

Alright, this /retest could work well

andreyvelich · 2025-08-05T23:46:54Z

/retest

andreyvelich · 2025-08-13T14:32:34Z

/ok-to-test-gpu-runner

andreyvelich · 2025-08-13T14:33:22Z

/label ok-to-test-gpu-runner

google-oss-prow · 2025-08-13T14:33:25Z

@andreyvelich: The label(s) /label ok-to-test-gpu-runner cannot be applied. These labels are supported: tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, lifecycle/needs-triage. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label ok-to-test-gpu-runner

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jaiakash · 2025-09-03T21:54:23Z

Hi @varodrig @Electronic-Waste @astefanutti Can you please review this?

astefanutti · 2025-09-04T06:36:17Z

.github/workflows/test-e2e-gpu.yaml

+    strategy:
+      fail-fast: false
+      matrix:
+        kubernetes-version: ["1.33.1"]


Can this be the latest as of now 1.34.0?

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

astefanutti · 2025-09-04T07:36:03Z

hack/e2e-setup-gpu-cluster.sh

+)
+
+# TODO (andreyvelich): Discuss how we want to pre-load runtime images to the Kind cluster.
+TORCH_RUNTIME_IMAGE=pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime


@andreyvelich is this something we want to keep?
It might be easy to forget to update it when we'll update the runtime over time. WDYT?

Ideally, we should parse this information from the kustomize manifests: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/runtimes/kustomization.yaml
@jaiakash For your E2Es you don't need Torch runtime, we probably need to build TorchTune trainer, similar to controller manager image: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/runtimes/kustomization.yaml#L8-L9.
We can do that in the followup PRs.

Sure, I will add this to existing issue for enhancement of this.

astefanutti · 2025-09-04T15:22:28Z

/lgtm

Thanks @jaiakash, awesome work!

jaiakash · 2025-09-04T15:30:27Z

/cc @varodrig @Electronic-Waste
Can you please review this?

andreyvelich

Thanks for the updates @jaiakash!
/approve

google-oss-prow · 2025-09-04T15:32:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* feat: support for creating and managing gpu cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile bug Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * add: ci action to ask maintainers to add label to when changes are detected Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: fixed issues and cleanup Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: run check on change in pr Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * feat: added seperate workflow for gpu runner Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: deepspeed typo Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: add gpu label on PR without merging Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: merged into single action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fixL run runner as soon as label is added Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use gpu runner when label exist Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: revert changes and fix script permission Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: create gpu supported gpu Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia issue Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: gpu cluster and torchtune model Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebookpath and delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * tmp fix: notebook to use k8s client Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use akash sdk and fix notenook size Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebook error Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster before creating one and notebook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: kube config Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile add comment Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia runtime Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: disable e2e go Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: temporarly use my personal token Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: refactored code Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: take hf token from env of self runner vm Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: to run notebook directly Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * refactor: torchtune job Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: pre commit hook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rename ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * rem: delete cluster command from makefile Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * update: upgrade k8s to 1.34.0 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> --------- Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>

* feat: support for creating and managing gpu cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile bug Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * add: ci action to ask maintainers to add label to when changes are detected Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: fixed issues and cleanup Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: run check on change in pr Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * feat: added seperate workflow for gpu runner Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: deepspeed typo Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: add gpu label on PR without merging Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: merged into single action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fixL run runner as soon as label is added Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use gpu runner when label exist Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: revert changes and fix script permission Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: create gpu supported gpu Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia issue Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: gpu cluster and torchtune model Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebookpath and delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * tmp fix: notebook to use k8s client Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use akash sdk and fix notenook size Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebook error Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster before creating one and notebook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: kube config Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile add comment Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia runtime Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: disable e2e go Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: temporarly use my personal token Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: refactored code Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: take hf token from env of self runner vm Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: to run notebook directly Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * refactor: torchtune job Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: pre commit hook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rename ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * rem: delete cluster command from makefile Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * update: upgrade k8s to 1.34.0 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> --------- Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* feat: support for creating and managing gpu cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile bug Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * add: ci action to ask maintainers to add label to when changes are detected Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: fixed issues and cleanup Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: run check on change in pr Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * feat: added seperate workflow for gpu runner Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: deepspeed typo Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: add gpu label on PR without merging Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: merged into single action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fixL run runner as soon as label is added Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use gpu runner when label exist Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: revert changes and fix script permission Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: create gpu supported gpu Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia issue Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: gpu cluster and torchtune model Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebookpath and delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * tmp fix: notebook to use k8s client Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: use akash sdk and fix notenook size Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: notebook error Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster before creating one and notebook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: kube config Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: makefile add comment Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: nvidia runtime Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: disable e2e go Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: delete cluster Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: temporarly use my personal token Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: refactored code Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * hotfix: take hf token from env of self runner vm Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: to run notebook directly Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * refactor: torchtune job Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * fix: pre commit hook Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rename ci action Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * rem: delete cluster command from makefile Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> * update: upgrade k8s to 1.34.0 Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> --------- Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com> Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>

google-oss-prow bot added the do-not-merge/work-in-progress label Jul 31, 2025

google-oss-prow bot requested review from astefanutti and kuizhiqing July 31, 2025 13:24

google-oss-prow bot added the size/L label Jul 31, 2025

google-oss-prow bot added the ok-to-test label Jul 31, 2025

jaiakash force-pushed the support-for-gpu-cluster-using-oci-runner branch from f5326fe to 7373b95 Compare August 2, 2025 08:22

andreyvelich mentioned this pull request Aug 5, 2025

feat: run workflows on /ok-to-test label #2639

Merged

1 task

jaiakash marked this pull request as ready for review August 13, 2025 10:30

google-oss-prow bot removed the do-not-merge/work-in-progress label Aug 13, 2025

jaiakash force-pushed the support-for-gpu-cluster-using-oci-runner branch from 0781936 to 55ac00a Compare August 13, 2025 10:32

andreyvelich added ok-to-test-gpu-runner and removed ok-to-test-gpu-runner labels Aug 13, 2025

jaiakash force-pushed the support-for-gpu-cluster-using-oci-runner branch from a2db60c to b7b69e1 Compare August 13, 2025 15:45

andreyvelich added the ok-to-test-gpu-runner label Aug 13, 2025

jaiakash force-pushed the support-for-gpu-cluster-using-oci-runner branch from 54205d2 to 37ba59b Compare August 14, 2025 14:50

andreyvelich added ok-to-test-gpu-runner and removed ok-to-test-gpu-runner labels Aug 14, 2025

google-oss-prow bot added the lgtm label Sep 3, 2025

jaiakash mentioned this pull request Sep 3, 2025

fix: gpu e2e test to run on pull_request_target and use kubeflow/trainer secret #2814

Closed

astefanutti reviewed Sep 4, 2025

View reviewed changes

update: upgrade k8s to 1.34.0

158294f

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

google-oss-prow bot removed the lgtm label Sep 4, 2025

astefanutti reviewed Sep 4, 2025

View reviewed changes

jaiakash requested a review from astefanutti September 4, 2025 15:18

google-oss-prow bot assigned astefanutti Sep 4, 2025

google-oss-prow bot added the lgtm label Sep 4, 2025

google-oss-prow bot requested review from Electronic-Waste and varodrig September 4, 2025 15:30

andreyvelich reviewed Sep 4, 2025

View reviewed changes

google-oss-prow bot added the approved label Sep 4, 2025

google-oss-prow bot merged commit b9f0602 into kubeflow:master Sep 4, 2025
28 checks passed

google-oss-prow bot added this to the v2.1 milestone Sep 4, 2025

jaiakash deleted the support-for-gpu-cluster-using-oci-runner branch September 4, 2025 15:41

This was referenced Sep 4, 2025

feat: add HF token and allow gpu workflow to run from pull request #2817

Closed

feat: add HF token and allow gpu workflow to run from pull request target #2818

Merged

This was referenced Sep 10, 2025

[GSoC] Project 7: GPU Testing for LLM Blueprints #2674

Open

fix: read only permission for PRs #2827

Merged

bug: secure secrets from unauthorized access (avoid pull_request_target misuse) #2828

Closed

This was referenced Sep 18, 2025

Add Qwen2 example to the trainer #2834

Closed

fix: For GPU based E2E test to use Qwen2.5 example #2840

Closed

Uh oh!

feat: support for managing gpu enabled self runner infra #2762

feat: support for managing gpu enabled self runner infra #2762

Uh oh!

Conversation

jaiakash commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction:

1. Support for creating and managing gpu cluster:

2. CI Workflow Integration

Uh oh!

review-notebook-app bot commented Jul 31, 2025

Uh oh!

coveralls commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 17456091730

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

andreyvelich commented Jul 31, 2025

Uh oh!

andreyvelich commented Jul 31, 2025

Uh oh!

andreyvelich commented Jul 31, 2025

Uh oh!

tenzen-y commented Jul 31, 2025

Uh oh!

tenzen-y commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreyvelich commented Aug 5, 2025

Uh oh!

andreyvelich commented Aug 13, 2025

Uh oh!

andreyvelich commented Aug 13, 2025

Uh oh!

google-oss-prow bot commented Aug 13, 2025

Uh oh!

jaiakash commented Sep 3, 2025

Uh oh!

astefanutti Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

jaiakash Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

astefanutti Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

andreyvelich Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

jaiakash Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

astefanutti commented Sep 4, 2025

Uh oh!

jaiakash commented Sep 4, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jaiakash commented Jul 31, 2025 •

edited

Loading

coveralls commented Jul 31, 2025 •

edited

Loading

tenzen-y commented Jul 31, 2025 •

edited

Loading