chore(test): Support e2e cluster setup with Podman #2861

astefanutti · 2025-09-29T07:08:20Z

What this PR does / why we need it:

Support setting up e2e cluster with Podman in addition to Docker.

Checklist:

Docs included if any changes are user facing

coveralls · 2025-09-29T07:12:54Z

Pull Request Test Coverage Report for Build 18186567965

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 55.174%

Totals
Change from base Build 18045194431:	0.0%
Covered Lines:	1093
Relevant Lines:	1981

💛 - Coveralls

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

jaiakash · 2025-10-01T17:53:19Z

Hi @andreyvelich Can you please add the ok-to-test-gpu-runner label to this PR to verify if podman doesn't break the GPU E2E test.
For GPU E2E test, We are setting up the runtime to docker here, wanted to confirm that part.

trainer/docs/proposals/2432-gpu-testing-on-llm-blueprints/OCI VM/bootstrap.sh

Line 71 in b918411

sudo nvidia-ctk runtime configure --runtime=docker

andreyvelich

@astefanutti Did you add support for Podman to allow users run this E2Es locally, right?
IIUC, we still use Docker for our CI tests.

astefanutti · 2025-10-01T18:57:52Z

@astefanutti Did you add support for Podman to allow users run this E2Es locally, right?
IIUC, we still use Docker for our CI tests.

@andreyvelich you're right, it's meant to run E2Es locally. That being said @jaiakash rightly suggested to run the GPU E2E to make sure there is no regression.

astefanutti · 2025-10-01T19:06:54Z

@jaiakash should nvkind be only used to create the KinD cluster and then kind be used, or can it be used to "alias" kind entirely?

jaiakash · 2025-10-01T20:18:22Z

@jaiakash should nvkind be only used to create the KinD cluster and then kind be used, or can it be used to "alias" kind entirely?

Yes for creating the cluster, nvkind have to be used. Default kind binary wont work. Ref: https://github.com/NVIDIA/nvkind Once the cluster is created, we can use the regular kind commands as usual, since it behaves like a standard KinD cluster after provisioning.

One potential issue I foresee is that we're currently configuring the NVIDIA container runtime for Docker:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

If we switch the container runtime to Podman, this setup might not work as expected. I was looking into this guide: Podman GPU Support. I will check it tmr morning btw.

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

astefanutti · 2025-10-02T08:30:07Z

Yes for creating the cluster, nvkind have to be used. Default kind binary wont work. Ref: https://github.com/NVIDIA/nvkind Once the cluster is created, we can use the regular kind commands as usual, since it behaves like a standard KinD cluster after provisioning.

@jaiakash thanks, I've updated it accordingly.

One potential issue I foresee is that we're currently configuring the NVIDIA container runtime for Docker:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
If we switch the container runtime to Podman, this setup might not work as expected. I was looking into this guide: Podman GPU Support. I will check it tmr morning btw.

It would be nice if that would work but we don't have to switch the GH Actions E2E workflows to use Podman.
It's really meant for people / contributors that use Podman to run the tests locally.

jaiakash

Thanks for the edit. @astefanutti

jaiakash · 2025-10-02T23:09:12Z

My bad it got assigned to me, I wanted to just LGTM the PR not assign me.

/assign @astefanutti

astefanutti · 2025-10-03T07:37:46Z

@jaiakash thanks.

/assign @kubeflow/kubeflow-trainer-team

tenzen-y

Thank you 👍
/lgtm
/approve

google-oss-prow · 2025-10-03T13:37:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from jinchihe and kuizhiqing September 29, 2025 07:08

google-oss-prow bot added the size/S label Sep 29, 2025

astefanutti force-pushed the pr-28 branch from 4488d93 to e6e093d Compare September 29, 2025 07:09

astefanutti force-pushed the pr-28 branch 3 times, most recently from 44121d4 to d71c92c Compare September 29, 2025 08:13

google-oss-prow bot added size/M and removed size/S labels Sep 29, 2025

chore(test): Support e2e cluster setup with Podman

d9c5be8

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

astefanutti force-pushed the pr-28 branch from d71c92c to d9c5be8 Compare September 29, 2025 08:52

andreyvelich reviewed Oct 1, 2025

View reviewed changes

andreyvelich added the ok-to-test-gpu-runner label Oct 1, 2025

Fix KinD cluster creation for GPU E2E tests

559965b

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

jaiakash approved these changes Oct 2, 2025

View reviewed changes

google-oss-prow bot assigned jaiakash Oct 2, 2025

google-oss-prow bot added the lgtm label Oct 2, 2025

google-oss-prow bot assigned astefanutti Oct 2, 2025

tenzen-y reviewed Oct 3, 2025

View reviewed changes

google-oss-prow bot assigned tenzen-y Oct 3, 2025

google-oss-prow bot added the approved label Oct 3, 2025

google-oss-prow bot merged commit 8c36148 into kubeflow:master Oct 3, 2025
28 checks passed

google-oss-prow bot added this to the v2.1 milestone Oct 3, 2025

astefanutti deleted the pr-28 branch October 3, 2025 13:56

chore(test): Support e2e cluster setup with Podman #2861

chore(test): Support e2e cluster setup with Podman #2861

Conversation

astefanutti commented Sep 29, 2025

Uh oh!

coveralls commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 18186567965

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

jaiakash commented Oct 1, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

astefanutti commented Oct 1, 2025

Uh oh!

astefanutti commented Oct 1, 2025

Uh oh!

jaiakash commented Oct 1, 2025

Uh oh!

astefanutti commented Oct 2, 2025

Uh oh!

jaiakash left a comment

Choose a reason for hiding this comment

Uh oh!

jaiakash commented Oct 2, 2025

Uh oh!

astefanutti commented Oct 3, 2025

Uh oh!

tenzen-y left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coveralls commented Sep 29, 2025 •

edited

Loading