-
Couldn't load subscription status.
- Fork 835
feat: support for managing gpu enabled self runner infra #2762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
ebb17bc
fcc01f0
e011c07
a32d199
6ee2921
106e2ff
3c3f17d
f38cef9
b0992ae
ccf9d0d
a6195cf
dc01280
7c9ce64
44030e9
5790054
268768f
d3506e7
c598d87
79c1835
9404704
578e600
77247e0
b70bcf2
d2a351e
2063417
f85b1ad
27b9c88
ca56b04
05cf8cf
bee80a3
5486fd2
f16da84
a609c41
605218a
cdd7f0d
4b63277
158294f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| name: GPU E2E Test | ||
|
|
||
| on: | ||
| pull_request: | ||
| types: [opened, reopened, synchronize, labeled] | ||
|
|
||
| jobs: | ||
| gpu-e2e-test: | ||
| name: GPU E2E Test | ||
| runs-on: oracle-vm-16cpu-a10gpu-240gb | ||
|
|
||
| env: | ||
| GOPATH: ${{ github.workspace }}/go | ||
| defaults: | ||
| run: | ||
| working-directory: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer | ||
|
|
||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| kubernetes-version: ["1.34.0"] | ||
|
|
||
| steps: | ||
| - name: Check GPU label | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you push this commit: #2762 (comment) ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not exactly the same code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, we can keep it for now. |
||
| id: check-label | ||
| run: | | ||
| if [[ "${{ join(github.event.pull_request.labels.*.name, ',') }}" != *"ok-to-test-gpu-runner"* ]]; then | ||
| echo "✅ Skipping GPU E2E tests (label not present)." | ||
| echo "skip=true" >> $GITHUB_OUTPUT | ||
| exit 0 | ||
| else | ||
| echo "Label found. Running GPU tests." | ||
| echo "skip=false" >> $GITHUB_OUTPUT | ||
| fi | ||
|
|
||
| - name: Check out code | ||
| if: steps.check-label.outputs.skip == 'false' | ||
| uses: actions/checkout@v4 | ||
| with: | ||
| path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer | ||
|
|
||
| - name: Setup Go | ||
| if: steps.check-label.outputs.skip == 'false' | ||
| uses: actions/setup-go@v5 | ||
| with: | ||
| go-version-file: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/go.mod | ||
|
|
||
| - name: Setup Python | ||
| if: steps.check-label.outputs.skip == 'false' | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: 3.11 | ||
|
|
||
| - name: Install dependencies | ||
| if: steps.check-label.outputs.skip == 'false' | ||
| run: | | ||
| pip install papermill==2.6.0 jupyter==1.1.1 ipykernel==6.29.5 | ||
| pip install git+https://github.com/kubeflow/sdk.git@main | ||
|
|
||
| - name: Setup cluster with GPU support using nvidia/kind | ||
| if: steps.check-label.outputs.skip == 'false' | ||
| run: | | ||
| make test-e2e-setup-gpu-cluster K8S_VERSION=${{ matrix.kubernetes-version }} | ||
|
|
||
| - name: Run e2e test on GPU cluster | ||
| if: steps.check-label.outputs.skip == 'false' | ||
| run: | | ||
| mkdir -p artifacts/notebooks | ||
| make test-e2e-notebook NOTEBOOK_INPUT=./examples/torchtune/llama3_2/alpaca-trainjob-yaml.ipynb NOTEBOOK_OUTPUT=./artifacts/notebooks/${{ matrix.kubernetes-version }}_alpaca-trainjob-yaml.ipynb TIMEOUT=900 | ||
|
|
||
| - name: Upload Artifacts to GitHub | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: ${{ matrix.kubernetes-version }} | ||
| path: ${{ env.GOPATH }}/src/github.com/kubeflow/trainer/artifacts/* | ||
| retention-days: 1 | ||
|
|
||
| delete-kind-cluster: | ||
| name: Delete kind Cluster | ||
| runs-on: oracle-vm-16cpu-a10gpu-240gb | ||
| needs: [gpu-e2e-test] | ||
| if: always() | ||
| steps: | ||
| - name: Delete any existing kind cluster | ||
| run: | | ||
| sudo kind delete cluster --name kind-gpu && echo "kind cluster has been deleted" || echo "kind cluster doesn't exist" | ||
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,124 @@ | ||||||||||||
| #!/usr/bin/env bash | ||||||||||||
|
|
||||||||||||
| # Copyright 2025 The Kubeflow Authors. | ||||||||||||
| # | ||||||||||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||||||||||
| # you may not use this file except in compliance with the License. | ||||||||||||
| # You may obtain a copy of the License at | ||||||||||||
| # | ||||||||||||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||
| # | ||||||||||||
| # Unless required by applicable law or agreed to in writing, software | ||||||||||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||||||||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||||||||
| # See the License for the specific language governing permissions and | ||||||||||||
| # limitations under the License. | ||||||||||||
|
|
||||||||||||
| # This shell is used to setup Kind cluster for Kubeflow Trainer e2e tests. | ||||||||||||
|
|
||||||||||||
| set -o errexit | ||||||||||||
| set -o nounset | ||||||||||||
| set -o pipefail | ||||||||||||
| set -x | ||||||||||||
|
|
||||||||||||
| # Configure variables. | ||||||||||||
| KIND=${KIND:-./bin/kind} | ||||||||||||
| K8S_VERSION=${K8S_VERSION:-1.32.0} | ||||||||||||
| GPU_OPERATOR_VERSION="v25.3.2" | ||||||||||||
| KIND_NODE_VERSION=kindest/node:v${K8S_VERSION} | ||||||||||||
| GPU_CLUSTER_NAME="kind-gpu" | ||||||||||||
| NAMESPACE="kubeflow-system" | ||||||||||||
| TIMEOUT="5m" | ||||||||||||
|
|
||||||||||||
| # Kubeflow Trainer images. | ||||||||||||
| # TODO (andreyvelich): Support initializers images. | ||||||||||||
| CONTROLLER_MANAGER_CI_IMAGE_NAME="ghcr.io/kubeflow/trainer/trainer-controller-manager" | ||||||||||||
| CONTROLLER_MANAGER_CI_IMAGE_TAG="test" | ||||||||||||
| CONTROLLER_MANAGER_CI_IMAGE="${CONTROLLER_MANAGER_CI_IMAGE_NAME}:${CONTROLLER_MANAGER_CI_IMAGE_TAG}" | ||||||||||||
| echo "Build Kubeflow Trainer images" | ||||||||||||
| sudo docker build . -f cmd/trainer-controller-manager/Dockerfile -t ${CONTROLLER_MANAGER_CI_IMAGE} | ||||||||||||
|
|
||||||||||||
| # Set up Docker to use NVIDIA runtime. | ||||||||||||
| sudo nvidia-ctk runtime configure --runtime=docker --set-as-default --cdi.enabled | ||||||||||||
| sudo nvidia-ctk config --set accept-nvidia-visible-devices-as-volume-mounts=true --in-place | ||||||||||||
| sudo systemctl restart docker | ||||||||||||
|
|
||||||||||||
| # Create a Kind cluster with GPU support. | ||||||||||||
| nvkind cluster create --name ${GPU_CLUSTER_NAME} --image "${KIND_NODE_VERSION}" | ||||||||||||
| nvkind cluster print-gpus | ||||||||||||
|
|
||||||||||||
| # Install gpu-operator to make sure we can run GPU workloads. | ||||||||||||
| echo "Install NVIDIA GPU Operator" | ||||||||||||
| kubectl create ns gpu-operator | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you going to refactor this script in the followup PR ? E.g. we can just re-use this script, and create cluster with trainer/hack/e2e-setup-cluster.sh Lines 33 to 37 in 05cf8cf
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes @andreyvelich In future, as discussed, I will merge this into single e2e-setup-cluster.sh based on flag. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jaiakash Please create an issue and assign it to yourself to track it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Created issue to track above suggestion #2812 |
||||||||||||
| kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged | ||||||||||||
|
|
||||||||||||
| helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update | ||||||||||||
|
|
||||||||||||
| helm install --wait --generate-name \ | ||||||||||||
| -n gpu-operator --create-namespace \ | ||||||||||||
| nvidia/gpu-operator \ | ||||||||||||
| --version="${GPU_OPERATOR_VERSION}" | ||||||||||||
|
|
||||||||||||
| # Validation steps for GPU operator installation | ||||||||||||
| kubectl get ns gpu-operator | ||||||||||||
| kubectl get ns gpu-operator --show-labels | grep pod-security.kubernetes.io/enforce=privileged | ||||||||||||
| helm list -n gpu-operator | ||||||||||||
| kubectl get pods -n gpu-operator -o name | while read pod; do | ||||||||||||
| kubectl wait --for=condition=Ready --timeout=300s "$pod" -n gpu-operator || echo "$pod failed to become Ready" | ||||||||||||
| done | ||||||||||||
| kubectl get pods -n gpu-operator | ||||||||||||
| kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu | ||||||||||||
|
|
||||||||||||
| # Load Kubeflow Trainer images | ||||||||||||
| echo "Load Kubeflow Trainer images" | ||||||||||||
| kind load docker-image "${CONTROLLER_MANAGER_CI_IMAGE}" --name "${GPU_CLUSTER_NAME}" | ||||||||||||
|
|
||||||||||||
| # Deploy Kubeflow Trainer control plane | ||||||||||||
| echo "Deploy Kubeflow Trainer control plane" | ||||||||||||
| E2E_MANIFESTS_DIR="artifacts/e2e/manifests" | ||||||||||||
| mkdir -p "${E2E_MANIFESTS_DIR}" | ||||||||||||
| cat <<EOF >"${E2E_MANIFESTS_DIR}/kustomization.yaml" | ||||||||||||
| apiVersion: kustomize.config.k8s.io/v1beta1 | ||||||||||||
| kind: Kustomization | ||||||||||||
| resources: | ||||||||||||
| - ../../../manifests/overlays/manager | ||||||||||||
| images: | ||||||||||||
| - name: "${CONTROLLER_MANAGER_CI_IMAGE_NAME}" | ||||||||||||
| newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}" | ||||||||||||
| EOF | ||||||||||||
|
|
||||||||||||
| kubectl apply --server-side -k "${E2E_MANIFESTS_DIR}" | ||||||||||||
|
|
||||||||||||
| # We should wait until Deployment is in Ready status. | ||||||||||||
| echo "Wait for Kubeflow Trainer to be ready" | ||||||||||||
| (kubectl wait deploy/kubeflow-trainer-controller-manager --for=condition=available -n ${NAMESPACE} --timeout ${TIMEOUT} && | ||||||||||||
| kubectl wait pods --for=condition=ready -n ${NAMESPACE} --timeout ${TIMEOUT} --all) || | ||||||||||||
| ( | ||||||||||||
| echo "Failed to wait until Kubeflow Trainer is ready" && | ||||||||||||
| kubectl get pods -n ${NAMESPACE} && | ||||||||||||
| kubectl describe pods -n ${NAMESPACE} && | ||||||||||||
| exit 1 | ||||||||||||
| ) | ||||||||||||
|
|
||||||||||||
| print_cluster_info() { | ||||||||||||
| kubectl version | ||||||||||||
| kubectl cluster-info | ||||||||||||
| kubectl get nodes | ||||||||||||
| kubectl get pods -n ${NAMESPACE} | ||||||||||||
| kubectl describe pod -n ${NAMESPACE} | ||||||||||||
| } | ||||||||||||
|
|
||||||||||||
| # TODO (andreyvelich): Currently, we print manager logs due to flaky test. | ||||||||||||
| echo "Deploy Kubeflow Trainer runtimes" | ||||||||||||
| kubectl apply --server-side -k manifests/overlays/runtimes || ( | ||||||||||||
| kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=trainer && | ||||||||||||
| print_cluster_info && | ||||||||||||
| exit 1 | ||||||||||||
| ) | ||||||||||||
|
|
||||||||||||
| # TODO (andreyvelich): Discuss how we want to pre-load runtime images to the Kind cluster. | ||||||||||||
| TORCH_RUNTIME_IMAGE=pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime | ||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andreyvelich is this something we want to keep? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally, we should parse this information from the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, I will add this to existing issue for enhancement of this. |
||||||||||||
| docker pull ${TORCH_RUNTIME_IMAGE} | ||||||||||||
| kind load docker-image ${TORCH_RUNTIME_IMAGE} --name ${GPU_CLUSTER_NAME} | ||||||||||||
|
|
||||||||||||
| print_cluster_info | ||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaiakash Please can you open an issue to track that we should refactor this action with pull_request_target and get HF_TOKEN from GitHub secrets ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes