-
Notifications
You must be signed in to change notification settings - Fork 834
Add Helm chart for kubeflow trainer #2435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Helm chart for kubeflow trainer #2435
Conversation
charts/trainer/README.md
Outdated
| @@ -0,0 +1,105 @@ | |||
| # trainer | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this @ChenYi015!
Did you get a chance to explore Kueue script to generate Helm Charts from Kustomize manifests ?
I think, that should significantly help us to keep Kustomize and Charts in sync.
https://github.com/kubernetes-sigs/kueue/blob/main/hack/update-helm.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look at the shell script, it is a bit complicated. I think it may be more easier to sync manifests templated by Helm charts to Kustomize manifests, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a shell script hack/sync-manifests. Now one can execute make sync-manifests to sync the Kustomize manifests from the manifests templated by the Helm chart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this effort @ChenYi015!
@tenzen-y @kannon92 @ahg-g @astefanutti @Electronic-Waste @kubeflow/wg-training-leads What do you think about it ?
Should we go other way around to sync Helm Charts to Kustomize Manifests ?
We can use the same approach for JobSet/Kueue if that is easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kubeflow/wg-manifests-leads @kubeflow/release-team @varodrig to review script of sync Kustomize + Helm automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7dc781c to
c0b1d7d
Compare
|
@ChenYi015 Given that we created Helm Charts for JobSet, are we ready to finish this PR? |
Not yet. We are still working on that change. Once we have a release it should push to the registry. |
a4b36db to
4cd2344
Compare
4cd2344 to
77c023d
Compare
|
/hold for waiting jobset helm chart to be published |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great work! Thank for working on this.
I left a couple of comments and recommendations.
77c023d to
097e217
Compare
|
/hold cancel for jobset chart v0.8.0 has been released. I have updated the PR, now one can test the Helm chart installation with the following comand: |
|
/unhold |
097e217 to
6620551
Compare
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Yi Chen <github@chenyicn.net>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: saileshd1402. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChenYi015 I was testing your Helm Charts. A few questions:
- Can we remove the
kubeflow-trainer-prefix from the JobSet controller or this is a requirements that all resources created from this chart must start withkubeflow-trainer-?
k get pods -n $NS
NAME READY STATUS RESTARTS AGE
kubeflow-trainer-controller-manager-9d8bb45ff-gm82p 0/1 ContainerCreating 0 4s
kubeflow-trainer-jobset-controller-744494458d-6hmgw 0/1 Running 1 (3s ago) 4s
-
The ClusterTrainingRuntime are not installed as part of the Charts.
-
For JobSet why do we create RoleBinding and ClusterRoleBinding ?
k get rolebinding -n kubeflow-system
NAME ROLE AGE
kubeflow-trainer-jobset-controller Role/kubeflow-trainer-jobset-controller 5m23s
k get clusterrolebinding -n kubeflow-system | grep jobset
kubeflow-trainer-jobset-controller ClusterRole/kubeflow-trainer-jobset-controller 5m26s
cc @kannon92 @kubeflow/wg-training-leads @Electronic-Waste @astefanutti
charts/kubeflow-trainer/values.yaml
Outdated
| # -- Image tag. | ||
| # @default -- If not set, the chart version will be used. | ||
| tag: "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, the default installation fails.
| # -- Image tag. | |
| # @default -- If not set, the chart version will be used. | |
| tag: "" | |
| # -- Image tag. | |
| tag: latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it will work as expected if we have released v2.0.0 and push associated image kubeflow/trainer/trainer-controller-manager:2.0.0 to GHCR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically, we try to keep latest tag in the master branch and release tag in the release-xx branch.
Like in Katib:
- master branch: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/installs/katib-standalone/kustomization.yaml#L23
- release branch: https://github.com/kubeflow/katib/blob/release-0.18/manifests/v1beta1/installs/katib-standalone/kustomization.yaml#L23
Shall we do the same for Helm Charts @ChenYi015 @Electronic-Waste @astefanutti @kubeflow/wg-training-leads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have updated to use latest as image tag.
I thinks this is the default behavior of installing jobset as a sub chart of trainer. I will try to find a way to remove the prefix.
The ClusterRole/ClusterRoleBinding comes from role.yaml and role_binding.yaml and are used to grant cluster-scoped permissions. The Role/RoleBinding comes from leader_election_role.yaml and leader_election_role_binding.yaml and are used to grant namespace-scoped permissions (e.g. leader election). |
Oh, I see. Maybe we should rename the postfix for this role as: cc @kannon92 |
We may have problems when do helm upgrading, when the webhook server are restarted and not ready to validate/mutate ClusterTrainingRuntime, it will fail the upgrading. I will try to find a better way to manage ClusterTrainingRuntime in the follow-up PRs. |
Signed-off-by: Yi Chen <github@chenyicn.net>
Signed-off-by: Yi Chen <github@chenyicn.net>
|
Thank you for the update! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@andreyvelich Thank you for your review and comments. Let us improve the Helm chart step by step. /unhold |
* Add Helm chart for kubeflow trainer Signed-off-by: Yi Chen <github@chenyicn.net> * Remove Role and RoleBinding resources Signed-off-by: Yi Chen <github@chenyicn.net> * Update RBAC rules for ConfigMap and Secret Signed-off-by: Yi Chen <github@chenyicn.net> * Remove RBAC rules for LimitRange Signed-off-by: Yi Chen <github@chenyicn.net> * Remove RBAC rules for RuntimeClass Signed-off-by: Yi Chen <github@chenyicn.net> * Update the year for header license comment Signed-off-by: Yi Chen <github@chenyicn.net> * Update header license comments Signed-off-by: Yi Chen <github@chenyicn.net> * Update Helm unit tests for manager deployment and service Signed-off-by: Yi Chen <github@chenyicn.net> * Add more Helm chart unit tests Signed-off-by: Yi Chen <github@chenyicn.net> * Remove manager role and roleBinding template helpers Signed-off-by: Yi Chen <github@chenyicn.net> * Remove trailing whitespace Signed-off-by: Yi Chen <github@chenyicn.net> * Update charts/kubeflow-trainer/values.yaml Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Yi Chen <github@chenyicn.net> * set image.tag to latest Signed-off-by: Yi Chen <github@chenyicn.net> * Update chart README Signed-off-by: Yi Chen <github@chenyicn.net> --------- Signed-off-by: Yi Chen <github@chenyicn.net> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
What this PR does / why we need it:
Close #1197
Checklist: