-
Notifications
You must be signed in to change notification settings - Fork 980
Synchronize trainer manifests 2.0 #3181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Synchronize trainer manifests 2.0 #3181
Conversation
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
tests/trainer_test.sh
Outdated
kubectl get pods -n $KF_PROFILE --show-labels | ||
kubectl wait --for=condition=Ready pod -l batch.kubernetes.io/job-name=pytorch-simple-node-0 -n $KF_PROFILE --timeout=180s | ||
|
||
kubectl wait --for=condition=Complete job/pytorch-simple-node-0 -n $KF_PROFILE --timeout=450s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliusvonkohout
this job is taking too much time to complete like around 25-30 minutes.
So should i change this condition to running or increase the timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would even like to lower the timeout if possible. 450 s is already a lot. I set only requests instead of limits now. that might help already. but if you can reduce it further, please do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes now all are green
Signed-off-by: Julius von Kohout <45896133+juliusvonkohout@users.noreply.github.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
kubectl apply --server-side --force-conflicts -f overlays/kubeflow-platform/kubeflow-trainer-roles.yaml | ||
kustomize build overlays/runtimes | kubectl apply --server-side --force-conflicts -f - | ||
|
||
kubectl get deployment -n kubeflow-system kubeflow-trainer-controller-manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please switch everything to kubeflow instead of kubeflow-system for now. We need an overlay in applications/trainer/overlays/... for that. I would wait until we have other components ready before opening up a new namespaces. Also for that namespace kubeflow-system we would have to set PSS restricted before we use it and introduce it to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- overlays/manager: Deploys to kubeflow-system namespace, Has a secretGenerator that creates the webhook certificate secret properly
- We are using this overlays/kubeflow-platform: Changes the overall namespace to kubeflow and removes kubeflow-system namespace, but the secretGenerator still references kubeflow-system.
- This creates a namespace mismatch for the webhook certificate which causing the test to fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could adjust the certificate. You can also go the other way around, so stick with kubeflow-system and add PSS to the kubeflow-system namespace and add default blocker networkpolicies for kubeflow-system and exceptions for the trainer webhook in kubeflow-system.
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
/retest |
@kunal-511: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
please fix the securitycontext here and upstream in the trainer repository. |
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
@juliusvonkohout I think this ready for the review |
/retest |
1 similar comment
/retest |
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
/retest |
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
Signed-off-by: kunal-511 <yoyokvunal@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets not have this under experimental, this is really a basic requirement now. It can be directly in the trainer folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and please create an upstream PR to fix this and ping @andreyvelich to merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets not have this under experimental, this is really a basic requirement now. It can be directly in the trainer folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and please create an upstream PR to fix this and ping @andreyvelich to merge it.
Pull Request Template for Kubeflow Manifests
✏️ Summary of Changes
📦 Dependencies
🐛 Related Issues
✅ Contributor Checklist