Skip to content

Conversation

@ChenYi015
Copy link
Member

@ChenYi015 ChenYi015 commented Feb 13, 2025

What this PR does / why we need it:

Close #1197

Checklist:

  • Docs included if any changes are user facing

@@ -0,0 +1,105 @@
# trainer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this @ChenYi015!
Did you get a chance to explore Kueue script to generate Helm Charts from Kustomize manifests ?
I think, that should significantly help us to keep Kustomize and Charts in sync.

https://github.com/kubernetes-sigs/kueue/blob/main/hack/update-helm.sh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look at the shell script, it is a bit complicated. I think it may be more easier to sync manifests templated by Helm charts to Kustomize manifests, WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a shell script hack/sync-manifests. Now one can execute make sync-manifests to sync the Kustomize manifests from the manifests templated by the Helm chart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this effort @ChenYi015!
@tenzen-y @kannon92 @ahg-g @astefanutti @Electronic-Waste @kubeflow/wg-training-leads What do you think about it ?
Should we go other way around to sync Helm Charts to Kustomize Manifests ?
We can use the same approach for JobSet/Kueue if that is easier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kubeflow/wg-manifests-leads @kubeflow/release-team @varodrig to review script of sync Kustomize + Helm automatically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenYi015 ChenYi015 marked this pull request as ready for review February 14, 2025 09:04
@google-oss-prow google-oss-prow bot requested a review from jinchihe February 14, 2025 09:04
@ChenYi015 ChenYi015 marked this pull request as draft February 14, 2025 09:05
@ChenYi015 ChenYi015 force-pushed the feature/helm-charts-v2 branch 2 times, most recently from 7dc781c to c0b1d7d Compare February 14, 2025 10:54
@andreyvelich
Copy link
Member

@ChenYi015 Given that we created Helm Charts for JobSet, are we ready to finish this PR?
@kannon92 did we push charts to the OCI registry after this PR: kubernetes-sigs/jobset#792 ?

@kannon92
Copy link
Contributor

@ChenYi015 Given that we created Helm Charts for JobSet, are we ready to finish this PR? @kannon92 did we push charts to the OCI registry after this PR: kubernetes-sigs/jobset#792 ?

Not yet. We are still working on that change. Once we have a release it should push to the registry.

@ChenYi015 ChenYi015 force-pushed the feature/helm-charts-v2 branch from a4b36db to 4cd2344 Compare February 25, 2025 03:42
@ChenYi015 ChenYi015 marked this pull request as ready for review February 25, 2025 03:48
@ChenYi015 ChenYi015 force-pushed the feature/helm-charts-v2 branch from 4cd2344 to 77c023d Compare February 26, 2025 07:26
@ChenYi015
Copy link
Member Author

/hold for waiting jobset helm chart to be published

Copy link
Contributor

@varodrig varodrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work! Thank for working on this.

I left a couple of comments and recommendations.

@ChenYi015 ChenYi015 force-pushed the feature/helm-charts-v2 branch from 77c023d to 097e217 Compare March 3, 2025 03:16
@ChenYi015
Copy link
Member Author

/hold cancel for jobset chart v0.8.0 has been released. I have updated the PR, now one can test the Helm chart installation with the following comand:

helm upgrade kubeflow-trainer charts/kubeflow-trainer \
      --install \
      --dependency-update \
      --namespace kubeflow-system \
      --create-namespace \
      --wait \
      --timeout 5m0s \
      --set jobset.install=true

@ChenYi015
Copy link
Member Author

/unhold

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Yi Chen <github@chenyicn.net>
@google-oss-prow google-oss-prow bot removed the lgtm label Mar 29, 2025
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@google-oss-prow
Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: saileshd1402.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/lgtm
/cc @tenzen-y @johnugeorge @saileshd1402 @Electronic-Waste @juliusvonkohout @deepanker13 @franciscojavierarceo @thesuperzapper @astefanutti

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenYi015 I was testing your Helm Charts. A few questions:

  1. Can we remove the kubeflow-trainer- prefix from the JobSet controller or this is a requirements that all resources created from this chart must start with kubeflow-trainer-?
k get pods -n $NS
NAME                                                  READY   STATUS              RESTARTS     AGE
kubeflow-trainer-controller-manager-9d8bb45ff-gm82p   0/1     ContainerCreating   0            4s
kubeflow-trainer-jobset-controller-744494458d-6hmgw   0/1     Running             1 (3s ago)   4s
  1. The ClusterTrainingRuntime are not installed as part of the Charts.

  2. For JobSet why do we create RoleBinding and ClusterRoleBinding ?

k get rolebinding -n kubeflow-system        
NAME                                 ROLE                                      AGE
kubeflow-trainer-jobset-controller   Role/kubeflow-trainer-jobset-controller   5m23s

k get clusterrolebinding -n kubeflow-system | grep jobset
kubeflow-trainer-jobset-controller                              ClusterRole/kubeflow-trainer-jobset-controller                                     5m26s

cc @kannon92 @kubeflow/wg-training-leads @Electronic-Waste @astefanutti

Comment on lines 40 to 42
# -- Image tag.
# @default -- If not set, the chart version will be used.
tag: ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, the default installation fails.

Suggested change
# -- Image tag.
# @default -- If not set, the chart version will be used.
tag: ""
# -- Image tag.
tag: latest

Copy link
Member Author

@ChenYi015 ChenYi015 Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it will work as expected if we have released v2.0.0 and push associated image kubeflow/trainer/trainer-controller-manager:2.0.0 to GHCR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically, we try to keep latest tag in the master branch and release tag in the release-xx branch.
Like in Katib:

Shall we do the same for Helm Charts @ChenYi015 @Electronic-Waste @astefanutti @kubeflow/wg-training-leads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated to use latest as image tag.

@ChenYi015
Copy link
Member Author

Can we remove the kubeflow-trainer- prefix from the JobSet controller or this is a requirements that all resources created from this chart must start with kubeflow-trainer-?

I thinks this is the default behavior of installing jobset as a sub chart of trainer. I will try to find a way to remove the prefix.

For JobSet why do we create RoleBinding and ClusterRoleBinding ?

The ClusterRole/ClusterRoleBinding comes from role.yaml and role_binding.yaml and are used to grant cluster-scoped permissions.

The Role/RoleBinding comes from leader_election_role.yaml and leader_election_role_binding.yaml and are used to grant namespace-scoped permissions (e.g. leader election).

@andreyvelich
Copy link
Member

The Role/RoleBinding comes from leader_election_role.yaml and leader_election_role_binding.yaml and are used to grant namespace-scoped permissions (e.g. leader election).

Oh, I see. Maybe we should rename the postfix for this role as: -leader-election. To be consistent with manifests: https://github.com/kubernetes-sigs/jobset/blob/main/config/components/rbac/role_binding.yaml

cc @kannon92

@ChenYi015
Copy link
Member Author

The ClusterTrainingRuntime are not installed as part of the Charts.

We may have problems when do helm upgrading, when the webhook server are restarted and not ready to validate/mutate ClusterTrainingRuntime, it will fail the upgrading. I will try to find a better way to manage ClusterTrainingRuntime in the follow-up PRs.

Signed-off-by: Yi Chen <github@chenyicn.net>
Signed-off-by: Yi Chen <github@chenyicn.net>
@google-oss-prow google-oss-prow bot removed the lgtm label Apr 10, 2025
@andreyvelich
Copy link
Member

Thank you for the update!
Feel free to un-hold the PR if you want to address the JobSet naming changes in the followup PRs.
/lgtm
/approve
/hold

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ChenYi015
Copy link
Member Author

Feel free to un-hold the PR if you want to address the JobSet naming changes in the followup PRs.

@andreyvelich Thank you for your review and comments. Let us improve the Helm chart step by step.

/unhold

@google-oss-prow google-oss-prow bot merged commit 541f027 into kubeflow:master Apr 11, 2025
18 checks passed
@ChenYi015 ChenYi015 deleted the feature/helm-charts-v2 branch April 11, 2025 02:46
akagami-harsh pushed a commit to akagami-harsh/training-operator that referenced this pull request Jul 17, 2025
* Add Helm chart for kubeflow trainer

Signed-off-by: Yi Chen <github@chenyicn.net>

* Remove Role and RoleBinding resources

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update RBAC rules for ConfigMap and Secret

Signed-off-by: Yi Chen <github@chenyicn.net>

* Remove RBAC rules for LimitRange

Signed-off-by: Yi Chen <github@chenyicn.net>

* Remove RBAC rules for RuntimeClass

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update the year for header license comment

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update header license comments

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update Helm unit tests for manager deployment and service

Signed-off-by: Yi Chen <github@chenyicn.net>

* Add more Helm chart unit tests

Signed-off-by: Yi Chen <github@chenyicn.net>

* Remove manager role and roleBinding template helpers

Signed-off-by: Yi Chen <github@chenyicn.net>

* Remove trailing whitespace

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update charts/kubeflow-trainer/values.yaml

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Yi Chen <github@chenyicn.net>

* set image.tag to latest

Signed-off-by: Yi Chen <github@chenyicn.net>

* Update chart README

Signed-off-by: Yi Chen <github@chenyicn.net>

---------

Signed-off-by: Yi Chen <github@chenyicn.net>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Helm Charts for Kubeflow Trainer V2

8 participants