Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
ebb17bc
feat: support for creating and managing gpu cluster
jaiakash Aug 2, 2025
fcc01f0
fix: makefile bug
jaiakash Aug 13, 2025
e011c07
add: ci action to ask maintainers to add label to when changes are de…
jaiakash Aug 13, 2025
a32d199
chore: fixed issues and cleanup
jaiakash Aug 13, 2025
6ee2921
fix: run check on change in pr
jaiakash Aug 13, 2025
106e2ff
feat: added seperate workflow for gpu runner
jaiakash Aug 13, 2025
3c3f17d
fix: deepspeed typo
jaiakash Aug 14, 2025
f38cef9
hotfix: add gpu label on PR without merging
jaiakash Aug 14, 2025
b0992ae
chore: merged into single action
jaiakash Aug 27, 2025
ccf9d0d
fixL run runner as soon as label is added
jaiakash Aug 27, 2025
a6195cf
fix: use gpu runner when label exist
jaiakash Aug 27, 2025
dc01280
fix: revert changes and fix script permission
jaiakash Aug 29, 2025
7c9ce64
fix: create gpu supported gpu
jaiakash Aug 29, 2025
44030e9
fix: nvidia issue
jaiakash Aug 29, 2025
5790054
fix: gpu cluster and torchtune model
jaiakash Aug 29, 2025
268768f
fix: notebookpath and delete cluster
jaiakash Aug 29, 2025
d3506e7
tmp fix: notebook to use k8s client
jaiakash Aug 30, 2025
c598d87
fix: use akash sdk and fix notenook size
jaiakash Aug 30, 2025
79c1835
fix: notebook error
jaiakash Aug 30, 2025
9404704
fix: delete cluster before creating one and notebook
jaiakash Aug 30, 2025
578e600
fix: kube config
jaiakash Aug 30, 2025
77247e0
fix: makefile add comment
jaiakash Aug 30, 2025
b70bcf2
fix: nvidia runtime
jaiakash Aug 30, 2025
d2a351e
hotfix: disable e2e go
jaiakash Aug 31, 2025
2063417
fix: delete cluster
jaiakash Aug 31, 2025
f85b1ad
fix: delete cluster
jaiakash Aug 31, 2025
27b9c88
hotfix: temporarly use my personal token
jaiakash Aug 31, 2025
ca56b04
chore: refactored code
jaiakash Aug 31, 2025
05cf8cf
hotfix: take hf token from env of self runner vm
jaiakash Aug 31, 2025
bee80a3
fix: to run notebook directly
jaiakash Sep 3, 2025
5486fd2
refactor: torchtune job
jaiakash Sep 3, 2025
f16da84
fix: ci action
jaiakash Sep 3, 2025
a609c41
fix: pre commit hook
jaiakash Sep 3, 2025
605218a
chore: rename ci action
jaiakash Sep 3, 2025
cdd7f0d
rem: delete cluster command from makefile
jaiakash Sep 3, 2025
4b63277
chore: rem some steps, fixed wait timing and notebook logs according …
jaiakash Sep 3, 2025
158294f
update: upgrade k8s to 1.34.0
jaiakash Sep 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions .github/workflows/test-e2e-gpu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
name: GPU E2E Test

on:
pull_request:
types: [opened, reopened, synchronize, labeled]
Comment on lines +4 to +5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaiakash Please can you open an issue to track that we should refactor this action with pull_request_target and get HF_TOKEN from GitHub secrets ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes


jobs:
gpu-e2e-test:
name: GPU E2E Test
runs-on: oracle-vm-16cpu-a10gpu-240gb

env:
GOPATH: ${{ github.workspace }}/go
defaults:
run:
working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer

strategy:
fail-fast: false
matrix:
kubernetes-version: ["1.34.0"]

steps:
- name: Check GPU label
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you push this commit: #2762 (comment) ?

Copy link
Member Author

@jaiakash jaiakash Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly the same code. exit 0 was exiting that part of step, not entire Check GPU label step. Added flag too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can keep it for now.

id: check-label
run: |
if [[ "${{ join(github.event.pull_request.labels.*.name, ',') }}" != *"ok-to-test-gpu-runner"* ]]; then
echo "✅ Skipping GPU E2E tests (label not present)."
echo "skip=true" >> $GITHUB_OUTPUT
exit 0
else
echo "Label found. Running GPU tests."
echo "skip=false" >> $GITHUB_OUTPUT
fi

- name: Check out code
if: steps.check-label.outputs.skip == 'false'
uses: actions/checkout@v4
with:
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer

- name: Setup Go
if: steps.check-label.outputs.skip == 'false'
uses: actions/setup-go@v5
with:
go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/go.mod

- name: Setup Python
if: steps.check-label.outputs.skip == 'false'
uses: actions/setup-python@v5
with:
python-version: 3.11

- name: Install dependencies
if: steps.check-label.outputs.skip == 'false'
run: |
pip install papermill==2.6.0 jupyter==1.1.1 ipykernel==6.29.5
pip install git+https://github.com/kubeflow/sdk.git@main

- name: Setup cluster with GPU support using nvidia/kind
if: steps.check-label.outputs.skip == 'false'
run: |
make test-e2e-setup-gpu-cluster K8S_VERSION=${{ matrix.kubernetes-version }}

- name: Run e2e test on GPU cluster
if: steps.check-label.outputs.skip == 'false'
run: |
mkdir -p artifacts/notebooks
make test-e2e-notebook NOTEBOOK_INPUT=./examples/torchtune/llama3_2/alpaca-trainjob-yaml.ipynb NOTEBOOK_OUTPUT=./artifacts/notebooks/${{ matrix.kubernetes-version }}_alpaca-trainjob-yaml.ipynb TIMEOUT=900

- name: Upload Artifacts to GitHub
if: always()
uses: actions/upload-artifact@v4
with:
name: ${{ matrix.kubernetes-version }}
path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/artifacts/*
retention-days: 1

delete-kind-cluster:
name: Delete kind Cluster
runs-on: oracle-vm-16cpu-a10gpu-240gb
needs: [gpu-e2e-test]
if: always()
steps:
- name: Delete any existing kind cluster
run: |
sudo kind delete cluster --name kind-gpu && echo "kind cluster has been deleted" || echo "kind cluster doesn't exist"
4 changes: 4 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,10 @@ test-python-integration: ## Run Python integration test.
test-e2e-setup-cluster: kind ## Setup Kind cluster for e2e test.
KIND=$(KIND) K8S_VERSION=$(K8S_VERSION) ./hack/e2e-setup-cluster.sh

.PHONY: test-e2e-setup-gpu-cluster
test-e2e-setup-gpu-cluster: kind ## Setup Kind cluster for GPU e2e test.
KIND=$(KIND) K8S_VERSION=$(K8S_VERSION) ./hack/e2e-setup-gpu-cluster.sh

.PHONY: test-e2e
test-e2e: ginkgo ## Run Go e2e test.
$(GINKGO) -v ./test/e2e/...
Expand Down
85 changes: 58 additions & 27 deletions examples/torchtune/llama3_2/alpaca-trainjob-yaml.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,9 @@
"id": "288ec515",
"metadata": {},
"outputs": [],
"source": "!pip install git+https://github.com/kubeflow/sdk.git@main"
"source": [
"!pip install git+https://github.com/kubeflow/sdk.git@main"
]
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -73,6 +75,8 @@
"source": [
"# List all available Kubeflow Training Runtimes.\n",
"from kubeflow.trainer import *\n",
"from kubeflow_trainer_api import models\n",
"import os\n",
"\n",
"client = TrainerClient()\n",
"for runtime in client.list_runtimes():\n",
Expand Down Expand Up @@ -154,19 +158,23 @@
],
"source": [
"# Create a PersistentVolumeClaim for the TorchTune Llama 3.2 1B model.\n",
"client.core_api.create_namespaced_persistent_volume_claim(\n",
" namespace=\"default\",\n",
" body=client.V1PersistentVolumeClaim(\n",
" api_version=\"v1\",\n",
" kind=\"PersistentVolumeClaim\",\n",
" metadata=client.V1ObjectMeta(name=\"torchtune-llama3.2-1b\"),\n",
" spec=client.V1PersistentVolumeClaimSpec(\n",
" access_modes=[\"ReadWriteOnce\"],\n",
" resources=client.V1ResourceRequirements(\n",
" requests={\"storage\": \"20Gi\"}\n",
" ),\n",
" ),\n",
" ),\n",
"client.backend.core_api.create_namespaced_persistent_volume_claim(\n",
" namespace=\"default\",\n",
" body=models.IoK8sApiCoreV1PersistentVolumeClaim(\n",
" apiVersion=\"v1\",\n",
" kind=\"PersistentVolumeClaim\",\n",
" metadata=models.IoK8sApimachineryPkgApisMetaV1ObjectMeta(\n",
" name=\"torchtune-llama3.2-1b\"\n",
" ),\n",
" spec=models.IoK8sApiCoreV1PersistentVolumeClaimSpec(\n",
" accessModes=[\"ReadWriteOnce\"],\n",
" resources=models.IoK8sApiCoreV1VolumeResourceRequirements(\n",
" requests={\n",
" \"storage\": models.IoK8sApimachineryPkgApiResourceQuantity(\"200Gi\")\n",
" }\n",
" ),\n",
" ),\n",
" ).to_dict(),\n",
")"
]
},
Expand All @@ -188,31 +196,51 @@
"outputs": [],
"source": [
"job_name = client.train(\n",
" runtime=Runtime(\n",
" name=\"torchtune-llama3.2-1b\"\n",
" ),\n",
" runtime=client.get_runtime(name=\"torchtune-llama3.2-1b\"),\n",
" initializer=Initializer(\n",
" dataset=HuggingFaceDatasetInitializer(\n",
" storage_uri=\"hf://tatsu-lab/alpaca/data\"\n",
" ),\n",
" model=HuggingFaceModelInitializer(\n",
" storage_uri=\"hf://meta-llama/Llama-3.2-1B-Instruct\",\n",
" access_token=\"<YOUR_HF_TOKEN>\" # Replace with your Hugging Face token,\n",
" access_token=os.environ[\"HF_TOKEN\"] # Replace with your Hugging Face token,\n",
" )\n",
" ),\n",
" trainer=BuiltinTrainer(\n",
" config=TorchTuneConfig(\n",
" dataset_preprocess_config=TorchTuneInstructDataset(\n",
" source=DataFormat.PARQUET,\n",
" source=DataFormat.PARQUET, split=\"train[:1000]\"\n",
" ),\n",
" resources_per_node={\n",
" \"memory\": \"200G\",\n",
" \"gpu\": 1,\n",
" }\n",
" },\n",
" \n",
" )\n",
" )\n",
")"
]
},
{
"cell_type": "markdown",
"id": "ee5fbe8e",
"metadata": {},
"source": [
"## Wait for running status"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53eaa65a",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Wait for the running status.\n",
"client.wait_for_job_status(name=job_name, status={\"Running\"})\n"
]
},
{
"cell_type": "markdown",
"id": "75a82b76",
Expand Down Expand Up @@ -247,8 +275,8 @@
"source": [
"from kubeflow.trainer.constants import constants\n",
"\n",
"log_dict = client.get_job_logs(job_name, follow=False, step=constants.DATASET_INITIALIZER)\n",
"print(log_dict[constants.DATASET_INITIALIZER])"
"for line in client.get_job_logs(job_name, follow=True, step=constants.DATASET_INITIALIZER):\n",
" print(line)"
]
},
{
Expand Down Expand Up @@ -279,16 +307,16 @@
}
],
"source": [
"log_dict = client.get_job_logs(job_name, follow=False, step=constants.MODEL_INITIALIZER)\n",
"print(log_dict[constants.MODEL_INITIALIZER])"
"for line in client.get_job_logs(job_name, follow=True, step=constants.MODEL_INITIALIZER):\n",
" print(line)"
]
},
{
"cell_type": "markdown",
"id": "b67775ea",
"metadata": {},
"source": [
"### Trainer Node"
"### Trainer Node "
]
},
{
Expand Down Expand Up @@ -392,8 +420,11 @@
}
],
"source": [
"log_dict = client.get_job_logs(job_name, follow=False)\n",
"print(log_dict[f\"{constants.NODE}-0\"])"
"for c in client.get_job(name=job_name).steps:\n",
" print(f\"Step: {c.name}, Status: {c.status}, Devices: {c.device} x {c.device_count}\\n\")\n",
"\n",
"for line in client.get_job_logs(job_name, follow=True):\n",
" print(line)"
]
},
{
Expand Down
124 changes: 124 additions & 0 deletions hack/e2e-setup-gpu-cluster.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
#!/usr/bin/env bash

# Copyright 2025 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This shell is used to setup Kind cluster for Kubeflow Trainer e2e tests.

set -o errexit
set -o nounset
set -o pipefail
set -x

# Configure variables.
KIND=${KIND:-./bin/kind}
K8S_VERSION=${K8S_VERSION:-1.32.0}
GPU_OPERATOR_VERSION="v25.3.2"
KIND_NODE_VERSION=kindest/node:v${K8S_VERSION}
GPU_CLUSTER_NAME="kind-gpu"
NAMESPACE="kubeflow-system"
TIMEOUT="5m"

# Kubeflow Trainer images.
# TODO (andreyvelich): Support initializers images.
CONTROLLER_MANAGER_CI_IMAGE_NAME="ghcr.io/kubeflow/trainer/trainer-controller-manager"
CONTROLLER_MANAGER_CI_IMAGE_TAG="test"
CONTROLLER_MANAGER_CI_IMAGE="${CONTROLLER_MANAGER_CI_IMAGE_NAME}:${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
echo "Build Kubeflow Trainer images"
sudo docker build . -f cmd/trainer-controller-manager/Dockerfile -t ${CONTROLLER_MANAGER_CI_IMAGE}

# Set up Docker to use NVIDIA runtime.
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled
sudo nvidia-ctk config --set accept-nvidia-visible-devices-as-volume-mounts=true --in-place
sudo systemctl restart docker

# Create a Kind cluster with GPU support.
nvkind cluster create --name ${GPU_CLUSTER_NAME} --image "${KIND_NODE_VERSION}"
nvkind cluster print-gpus

# Install gpu-operator to make sure we can run GPU workloads.
echo "Install NVIDIA GPU Operator"
kubectl create ns gpu-operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to refactor this script in the followup PR ? E.g. we can just re-use this script, and create cluster with nvkind if GPU is used:

CONTROLLER_MANAGER_CI_IMAGE_NAME="ghcr.io/kubeflow/trainer/trainer-controller-manager"
CONTROLLER_MANAGER_CI_IMAGE_TAG="test"
CONTROLLER_MANAGER_CI_IMAGE="${CONTROLLER_MANAGER_CI_IMAGE_NAME}:${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
echo "Build Kubeflow Trainer images"
docker build . -f cmd/trainer-controller-manager/Dockerfile -t ${CONTROLLER_MANAGER_CI_IMAGE}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @andreyvelich In future, as discussed, I will merge this into single e2e-setup-cluster.sh based on flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaiakash Please create an issue and assign it to yourself to track it.

Copy link
Member Author

@jaiakash jaiakash Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I will create a issue for this.

By the way, can you please review issue: #2809
PR: #2810

Copy link
Member Author

@jaiakash jaiakash Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created issue to track above suggestion #2812

kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update

helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version="${GPU_OPERATOR_VERSION}"

# Validation steps for GPU operator installation
kubectl get ns gpu-operator
kubectl get ns gpu-operator --show-labels | grep pod-security.kubernetes.io/enforce=privileged
helm list -n gpu-operator
kubectl get pods -n gpu-operator -o name | while read pod; do
kubectl wait --for=condition=Ready --timeout=300s "$pod" -n gpu-operator || echo "$pod failed to become Ready"
done
kubectl get pods -n gpu-operator
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu

# Load Kubeflow Trainer images
echo "Load Kubeflow Trainer images"
kind load docker-image "${CONTROLLER_MANAGER_CI_IMAGE}" --name "${GPU_CLUSTER_NAME}"

# Deploy Kubeflow Trainer control plane
echo "Deploy Kubeflow Trainer control plane"
E2E_MANIFESTS_DIR="artifacts/e2e/manifests"
mkdir -p "${E2E_MANIFESTS_DIR}"
cat <<EOF >"${E2E_MANIFESTS_DIR}/kustomization.yaml"
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../../manifests/overlays/manager
images:
- name: "${CONTROLLER_MANAGER_CI_IMAGE_NAME}"
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"
EOF

kubectl apply --server-side -k "${E2E_MANIFESTS_DIR}"

# We should wait until Deployment is in Ready status.
echo "Wait for Kubeflow Trainer to be ready"
(kubectl wait deploy/kubeflow-trainer-controller-manager --for=condition=available -n ${NAMESPACE} --timeout ${TIMEOUT} &&
kubectl wait pods --for=condition=ready -n ${NAMESPACE} --timeout ${TIMEOUT} --all) ||
(
echo "Failed to wait until Kubeflow Trainer is ready" &&
kubectl get pods -n ${NAMESPACE} &&
kubectl describe pods -n ${NAMESPACE} &&
exit 1
)

print_cluster_info() {
kubectl version
kubectl cluster-info
kubectl get nodes
kubectl get pods -n ${NAMESPACE}
kubectl describe pod -n ${NAMESPACE}
}

# TODO (andreyvelich): Currently, we print manager logs due to flaky test.
echo "Deploy Kubeflow Trainer runtimes"
kubectl apply --server-side -k manifests/overlays/runtimes || (
kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=trainer &&
print_cluster_info &&
exit 1
)

# TODO (andreyvelich): Discuss how we want to pre-load runtime images to the Kind cluster.
TORCH_RUNTIME_IMAGE=pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich is this something we want to keep?
It might be easy to forget to update it when we'll update the runtime over time. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should parse this information from the kustomize manifests: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/runtimes/kustomization.yaml
@jaiakash For your E2Es you don't need Torch runtime, we probably need to build TorchTune trainer, similar to controller manager image: https://github.com/kubeflow/trainer/blob/master/manifests/overlays/runtimes/kustomization.yaml#L8-L9.
We can do that in the followup PRs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will add this to existing issue for enhancement of this.

docker pull ${TORCH_RUNTIME_IMAGE}
kind load docker-image ${TORCH_RUNTIME_IMAGE} --name ${GPU_CLUSTER_NAME}

print_cluster_info
Loading