-
Notifications
You must be signed in to change notification settings - Fork 834
feat(operator): Add validation for required containers in replicatedJobs #2722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(operator): Add validation for required containers in replicatedJobs #2722
Conversation
Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: rudeigerc. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Pull Request Test Coverage Report for Build 18613219957Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Signed-off-by: Electronic-Waste <2690692950@qq.com>
/retest |
1 similar comment
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this PR @Electronic-Waste!
The validation looks good to me.
/assign @tenzen-y @astefanutti
// 2. model-initializer - model-initializer | ||
// 3. trainer - node | ||
hasRequiredContainer := false | ||
for _, container := range rJob.Template.Spec.Template.Spec.Containers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how difficult that would be to implement this validation in CEL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. These fields are defined in JobSetSpec:
trainer/pkg/apis/trainer/v1alpha1/trainingruntime_types.go
Lines 132 to 134 in 467322a
// spec of the desired JobSet which will be created from TrainJob. | |
// +optional | |
Spec jobsetv1alpha2.JobSetSpec `json:"spec,omitempty"` |
I'm not sure whether can we reference the container name in it (detecting corresponding label) and verify its value. In the contrast, I think webhook is more convenient and do not need to change the API by adding something like:
// +kubebuilder:validation:XValidation:rule="self > 0 || self in ['auto', 'cpu', 'gpu']", message="NumProcPerNode must be equal to auto, cpu, gpu, or int value" |
Signed-off-by: Electronic-Waste <2690692950@qq.com>
@andreyvelich @astefanutti Sorry for the late response. I've addressed all of your comments. Please take a look. |
@andreyvelich Maybe we can add this to milestone v2.1? I think it should be ready now:) /milestone 2.1 |
@Electronic-Waste: The provided milestone is not valid for this repository. Milestones in this repository: [ Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/milestone v2.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks @Electronic-Waste!
/lgtm
/assign @astefanutti
Thanks @Electronic-Waste! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: astefanutti The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Currently, we'll override the
node
container if the target replicatedJob hastrainer.kubeflow.org/trainjob-ancestor-step: trainer
label:trainer/pkg/runtime/framework/plugins/jobset/builder.go
Lines 119 to 151 in ed5f859
This design works well in official runtimes provided by Kubeflow Trainer. However, when it comes to custom runtimes written by users, they may find that they failed to override the configs in the target replicatedJob using if they assign different name to
node
container (liketrain
,trainer
). This is tricky, error prone, and hard to figure out.In order to avoid this scenario, I add an validation condition in the (Cluster)TrainingRuntime's webhook, and also provide som e UTs for it.
/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: