Skip to content

Conversation

@jaiakash
Copy link
Member

@jaiakash jaiakash commented Jul 31, 2025

Introduction:

This PR is part of Phase 1 of the GSoC 2025 Project 7 – GPU Testing for LLM Blueprints.

1. Support for creating and managing gpu cluster:

This PR adds support for running GPU-enabled workloads using an Oracle Cloud Infrastructure (OCI) GPU-based VM as a self-hosted GitHub Actions runner. The goal is to enable execution of LLM blueprint examples with GPU enabled cluster hosted on Oracle Cloud.

  • Replaces kind with nvkind to deploy a GPU-capable Kubernetes cluster locally.
  • Configures the container runtime to use the NVIDIA Container Runtime for GPU scheduling.
  • Installs the NVIDIA GPU Operator via Helm, which provisions:
    • NVIDIA Drivers
    • Kubernetes Device Plugin
    • DCGM Exporter for GPU telemetry

2. CI Workflow Integration

  • Adds a GitHub Actions workflow that triggers only when changes are made in the trainer/examples/ directory.
  • The job runs only after approval from one of the maintainers listed below.

Maintainers for this CI Action:
@andreyvelich, @varodrig, @jaiakash


diagram-export-7-31-2025-6_39_09-PM

What this PR does / why we need it:
Related #2674 #2432

Checklist:

  • Docs included if any changes are user facing

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@coveralls
Copy link

coveralls commented Jul 31, 2025

Pull Request Test Coverage Report for Build 17456091730

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 52.136%

Totals Coverage Status
Change from base Build 17414373409: 0.0%
Covered Lines: 1025
Relevant Lines: 1966

💛 - Coveralls

@andreyvelich
Copy link
Member

/ok-to-test

@andreyvelich
Copy link
Member

/rerun

@andreyvelich
Copy link
Member

/rerun-all

@tenzen-y
Copy link
Member

/rerun-all

IIUC, that does not work fine.

/retest

@tenzen-y
Copy link
Member

tenzen-y commented Jul 31, 2025

/rerun-all

IIUC, that does not work fine.

/retest

Alright, this /retest could work well

Screenshot 2025-07-31 at 23 54 55

@jaiakash jaiakash force-pushed the support-for-gpu-cluster-using-oci-runner branch from f5326fe to 7373b95 Compare August 2, 2025 08:22
@andreyvelich
Copy link
Member

/retest

@jaiakash jaiakash marked this pull request as ready for review August 13, 2025 10:30
@jaiakash jaiakash force-pushed the support-for-gpu-cluster-using-oci-runner branch from 0781936 to 55ac00a Compare August 13, 2025 10:32
@andreyvelich
Copy link
Member

/ok-to-test-gpu-runner

@andreyvelich
Copy link
Member

/label ok-to-test-gpu-runner

@google-oss-prow
Copy link

@andreyvelich: The label(s) /label ok-to-test-gpu-runner cannot be applied. These labels are supported: tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, lifecycle/needs-triage. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label ok-to-test-gpu-runner

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jaiakash
Copy link
Member Author

jaiakash commented Sep 3, 2025

Hi @varodrig @Electronic-Waste @astefanutti Can you please review this?

strategy:
fail-fast: false
matrix:
kubernetes-version: ["1.33.1"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be the latest as of now 1.34.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Sep 4, 2025
)

# TODO (andreyvelich): Discuss how we want to pre-load runtime images to the Kind cluster.
TORCH_RUNTIME_IMAGE=pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich is this something we want to keep?
It might be easy to forget to update it when we'll update the runtime over time. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should parse this information from the kustomize manifests: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/runtimes/kustomization.yaml
@jaiakash For your E2Es you don't need Torch runtime, we probably need to build TorchTune trainer, similar to controller manager image: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/runtimes/kustomization.yaml#L8-L9.
We can do that in the followup PRs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will add this to existing issue for enhancement of this.

@jaiakash jaiakash requested a review from astefanutti September 4, 2025 15:18
@astefanutti
Copy link
Contributor

/lgtm

Thanks @jaiakash, awesome work!

@jaiakash
Copy link
Member Author

jaiakash commented Sep 4, 2025

/cc @varodrig @Electronic-Waste
Can you please review this?

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @jaiakash!
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit b9f0602 into kubeflow:master Sep 4, 2025
28 checks passed
@google-oss-prow google-oss-prow bot added this to the v2.1 milestone Sep 4, 2025
@jaiakash jaiakash deleted the support-for-gpu-cluster-using-oci-runner branch September 4, 2025 15:41
tdn21 pushed a commit to tdn21/trainer that referenced this pull request Sep 6, 2025
* feat: support for creating and managing gpu cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile bug

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* add: ci action to ask maintainers to add label to when changes are detected

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: fixed issues and cleanup

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: run check on change in pr

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* feat: added seperate workflow for gpu runner

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: deepspeed typo

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: add gpu label on PR without merging

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: merged into single action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fixL run runner as soon as label is added

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use gpu runner when label exist

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: revert changes and fix script permission

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: create gpu supported gpu

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia issue

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: gpu cluster and torchtune model

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebookpath and delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* tmp fix: notebook to use k8s client

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use akash sdk and fix notenook size

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebook error

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster before creating one and notebook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: kube config

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile add comment

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia runtime

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: disable e2e go

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: temporarly use my personal token

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: refactored code

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: take hf token from env of self runner vm

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: to run notebook directly

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* refactor: torchtune job

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: pre commit hook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rename ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* rem: delete cluster command from makefile

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* update: upgrade k8s to 1.34.0

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

---------

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Tarun Duhan <itarunduhan@gmail.com>
mahdikhashan pushed a commit to mahdikhashan/trainer that referenced this pull request Oct 4, 2025
* feat: support for creating and managing gpu cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile bug

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* add: ci action to ask maintainers to add label to when changes are detected

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: fixed issues and cleanup

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: run check on change in pr

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* feat: added seperate workflow for gpu runner

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: deepspeed typo

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: add gpu label on PR without merging

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: merged into single action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fixL run runner as soon as label is added

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use gpu runner when label exist

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: revert changes and fix script permission

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: create gpu supported gpu

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia issue

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: gpu cluster and torchtune model

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebookpath and delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* tmp fix: notebook to use k8s client

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use akash sdk and fix notenook size

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebook error

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster before creating one and notebook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: kube config

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile add comment

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia runtime

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: disable e2e go

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: temporarly use my personal token

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: refactored code

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: take hf token from env of self runner vm

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: to run notebook directly

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* refactor: torchtune job

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: pre commit hook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rename ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* rem: delete cluster command from makefile

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* update: upgrade k8s to 1.34.0

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

---------

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
mahdikhashan pushed a commit to mahdikhashan/trainer that referenced this pull request Oct 4, 2025
* feat: support for creating and managing gpu cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile bug

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* add: ci action to ask maintainers to add label to when changes are detected

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: fixed issues and cleanup

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: run check on change in pr

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* feat: added seperate workflow for gpu runner

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: deepspeed typo

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: add gpu label on PR without merging

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: merged into single action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fixL run runner as soon as label is added

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use gpu runner when label exist

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: revert changes and fix script permission

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: create gpu supported gpu

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia issue

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: gpu cluster and torchtune model

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebookpath and delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* tmp fix: notebook to use k8s client

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: use akash sdk and fix notenook size

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: notebook error

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster before creating one and notebook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: kube config

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: makefile add comment

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: nvidia runtime

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: disable e2e go

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: delete cluster

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: temporarly use my personal token

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: refactored code

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* hotfix: take hf token from env of self runner vm

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: to run notebook directly

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* refactor: torchtune job

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* fix: pre commit hook

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rename ci action

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* rem: delete cluster command from makefile

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* chore: rem some steps, fixed wait timing and notebook logs according to kubeflow/sdk#83

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* update: upgrade k8s to 1.34.0

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

---------

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants