Skip to content

Conversation

astefanutti
Copy link
Contributor

What this PR does / why we need it:

Support setting up e2e cluster with Podman in addition to Docker.

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Sep 29, 2025

Pull Request Test Coverage Report for Build 18186567965

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 55.174%

Totals Coverage Status
Change from base Build 18045194431: 0.0%
Covered Lines: 1093
Relevant Lines: 1981

💛 - Coveralls

@astefanutti astefanutti force-pushed the pr-28 branch 3 times, most recently from 44121d4 to d71c92c Compare September 29, 2025 08:13
@google-oss-prow google-oss-prow bot added size/M and removed size/S labels Sep 29, 2025
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
@jaiakash
Copy link
Member

jaiakash commented Oct 1, 2025

Hi @andreyvelich Can you please add the ok-to-test-gpu-runner label to this PR to verify if podman doesn't break the GPU E2E test.
For GPU E2E test, We are setting up the runtime to docker here, wanted to confirm that part.

sudo nvidia-ctk runtime configure --runtime=docker

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Did you add support for Podman to allow users run this E2Es locally, right?
IIUC, we still use Docker for our CI tests.

@astefanutti
Copy link
Contributor Author

@astefanutti Did you add support for Podman to allow users run this E2Es locally, right?
IIUC, we still use Docker for our CI tests.

@andreyvelich you're right, it's meant to run E2Es locally. That being said @jaiakash rightly suggested to run the GPU E2E to make sure there is no regression.

@astefanutti
Copy link
Contributor Author

@jaiakash should nvkind be only used to create the KinD cluster and then kind be used, or can it be used to "alias" kind entirely?

@jaiakash
Copy link
Member

jaiakash commented Oct 1, 2025

@jaiakash should nvkind be only used to create the KinD cluster and then kind be used, or can it be used to "alias" kind entirely?

Yes for creating the cluster, nvkind have to be used. Default kind binary wont work. Ref: https://github.com/NVIDIA/nvkind Once the cluster is created, we can use the regular kind commands as usual, since it behaves like a standard KinD cluster after provisioning.

One potential issue I foresee is that we're currently configuring the NVIDIA container runtime for Docker:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

If we switch the container runtime to Podman, this setup might not work as expected. I was looking into this guide: Podman GPU Support. I will check it tmr morning btw.

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
@astefanutti
Copy link
Contributor Author

Yes for creating the cluster, nvkind have to be used. Default kind binary wont work. Ref: https://github.com/NVIDIA/nvkind Once the cluster is created, we can use the regular kind commands as usual, since it behaves like a standard KinD cluster after provisioning.

@jaiakash thanks, I've updated it accordingly.

One potential issue I foresee is that we're currently configuring the NVIDIA container runtime for Docker:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

If we switch the container runtime to Podman, this setup might not work as expected. I was looking into this guide: Podman GPU Support. I will check it tmr morning btw.

It would be nice if that would work but we don't have to switch the GH Actions E2E workflows to use Podman.
It's really meant for people / contributors that use Podman to run the tests locally.

Copy link
Member

@jaiakash jaiakash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the edit. @astefanutti

@jaiakash
Copy link
Member

jaiakash commented Oct 2, 2025

My bad it got assigned to me, I wanted to just LGTM the PR not assign me.

/assign @astefanutti

@astefanutti
Copy link
Contributor Author

@jaiakash thanks.

/assign @kubeflow/kubeflow-trainer-team

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 👍
/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 8c36148 into kubeflow:master Oct 3, 2025
28 checks passed
@google-oss-prow google-oss-prow bot added this to the v2.1 milestone Oct 3, 2025
@astefanutti astefanutti deleted the pr-28 branch October 3, 2025 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants