Skip to content

Synchronize trainer manifests 2.0 #3181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: master
Choose a base branch
from

Conversation

kunal-511
Copy link
Contributor

Pull Request Template for Kubeflow Manifests

✏️ Summary of Changes

Describe the changes you have made, including any refactoring or feature additions.

📦 Dependencies

List any dependencies or related PRs (e.g., "Depends on #123").

🐛 Related Issues

Link any issues that are resolved or affected by this PR.

✅ Contributor Checklist

  • I have tested these changes with kustomize. See Installation Prerequisites.
  • All commits are signed-off to satisfy the DCO check.
  • I have considered adding my company to the adopters page to support Kubeflow and help the community, since I expect help from the community for my issue (see 1. and 2.).

You can join the CNCF Slack and access our meetings at the Kubeflow Community website. Our channel on the CNCF Slack is here #kubeflow-platform.

kunal-511 added 2 commits July 1, 2025 23:39
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kimwnasptd for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kunal-511 added 5 commits July 2, 2025 00:10
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
kubectl get pods -n $KF_PROFILE --show-labels
kubectl wait --for=condition=Ready pod -l batch.kubernetes.io/job-name=pytorch-simple-node-0 -n $KF_PROFILE --timeout=180s

kubectl wait --for=condition=Complete job/pytorch-simple-node-0 -n $KF_PROFILE --timeout=450s
Copy link
Contributor Author

@kunal-511 kunal-511 Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliusvonkohout
this job is taking too much time to complete like around 25-30 minutes.
So should i change this condition to running or increase the timeout?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even like to lower the timeout if possible. 450 s is already a lot. I set only requests instead of limits now. that might help already. but if you can reduce it further, please do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes now all are green

juliusvonkohout and others added 2 commits July 2, 2025 11:43
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
kubectl apply --server-side --force-conflicts -f overlays/kubeflow-platform/kubeflow-trainer-roles.yaml
kustomize build overlays/runtimes | kubectl apply --server-side --force-conflicts -f -

kubectl get deployment -n kubeflow-system kubeflow-trainer-controller-manager
Copy link
Member

@juliusvonkohout juliusvonkohout Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please switch everything to kubeflow instead of kubeflow-system for now. We need an overlay in applications/trainer/overlays/... for that. I would wait until we have other components ready before opening up a new namespaces. Also for that namespace kubeflow-system we would have to set PSS restricted before we use it and introduce it to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • overlays/manager: Deploys to kubeflow-system namespace, Has a secretGenerator that creates the webhook certificate secret properly
  • We are using this overlays/kubeflow-platform: Changes the overall namespace to kubeflow and removes kubeflow-system namespace, but the secretGenerator still references kubeflow-system.
  • This creates a namespace mismatch for the webhook certificate which causing the test to fail

Copy link
Member

@juliusvonkohout juliusvonkohout Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could adjust the certificate. You can also go the other way around, so stick with kubeflow-system and add PSS to the kubeflow-system namespace and add default blocker networkpolicies for kubeflow-system and exceptions for the trainer webhook in kubeflow-system.

kunal-511 added 2 commits July 3, 2025 14:43
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
@kunal-511
Copy link
Contributor Author

/retest

Copy link

@kunal-511: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@juliusvonkohout
Copy link
Member

/ok-to-test

Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
@kunal-511 kunal-511 marked this pull request as draft July 22, 2025 12:00
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
@kunal-511 kunal-511 marked this pull request as ready for review July 22, 2025 12:58
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
@juliusvonkohout
Copy link
Member

please fix the securitycontext here and upstream in the trainer repository.

Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
@kunal-511
Copy link
Contributor Author

@juliusvonkohout I think this ready for the review

@kunal-511
Copy link
Contributor Author

/retest

1 similar comment
@kunal-511
Copy link
Contributor Author

/retest

Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
@kunal-511
Copy link
Contributor Author

/retest

Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not have this under experimental, this is really a basic requirement now. It can be directly in the trainer folder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and please create an upstream PR to fix this and ping @andreyvelich to merge it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not have this under experimental, this is really a basic requirement now. It can be directly in the trainer folder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and please create an upstream PR to fix this and ping @andreyvelich to merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants