Skip to content

Conversation

Electronic-Waste
Copy link
Member

What this PR does / why we need it:

Currently, we'll override the node container if the target replicatedJob has trainer.kubeflow.org/trainjob-ancestor-step: trainer label:

// Update values for the Trainer container.
for j, container := range rJob.Template.Spec.Template.Spec.Containers {
if *container.Name == constants.Node {
// Update values from the TrainJob trainer.
if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil {
if image := jobTrainer.Image; image != nil {
b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Image = image
}
if command := jobTrainer.Command; command != nil {
b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Command = command
}
if args := jobTrainer.Args; args != nil {
b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Args = args
}
if resourcesPerNode := jobTrainer.ResourcesPerNode; resourcesPerNode != nil &&
(resourcesPerNode.Limits != nil || resourcesPerNode.Requests != nil) {
requirements := corev1ac.ResourceRequirements()
if limits := resourcesPerNode.Limits; limits != nil {
requirements.WithLimits(limits)
}
if requests := resourcesPerNode.Requests; requests != nil {
requirements.WithRequests(requests)
}
b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].
WithResources(requirements)
}
apply.UpsertEnvVars(
&b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Env,
apply.EnvVars(jobTrainer.Env...)...,
)
}
}
}

This design works well in official runtimes provided by Kubeflow Trainer. However, when it comes to custom runtimes written by users, they may find that they failed to override the configs in the target replicatedJob using if they assign different name to node container (like train, trainer). This is tricky, error prone, and hard to figure out.

In order to avoid this scenario, I add an validation condition in the (Cluster)TrainingRuntime's webhook, and also provide som e UTs for it.

/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Electronic-Waste <2690692950@qq.com>
@google-oss-prow google-oss-prow bot requested review from a team and astefanutti July 8, 2025 09:34
@google-oss-prow
Copy link

@Electronic-Waste: GitHub didn't allow me to request PR reviews from the following users: rudeigerc.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What this PR does / why we need it:

Currently, we'll override the node container if the target replicatedJob has trainer.kubeflow.org/trainjob-ancestor-step: trainer label:

// Update values for the Trainer container.
for j, container := range rJob.Template.Spec.Template.Spec.Containers {
if *container.Name == constants.Node {
// Update values from the TrainJob trainer.
if jobTrainer := trainJob.Spec.Trainer; jobTrainer != nil {
if image := jobTrainer.Image; image != nil {
b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Image = image
}
if command := jobTrainer.Command; command != nil {
b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Command = command
}
if args := jobTrainer.Args; args != nil {
b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Args = args
}
if resourcesPerNode := jobTrainer.ResourcesPerNode; resourcesPerNode != nil &&
(resourcesPerNode.Limits != nil || resourcesPerNode.Requests != nil) {
requirements := corev1ac.ResourceRequirements()
if limits := resourcesPerNode.Limits; limits != nil {
requirements.WithLimits(limits)
}
if requests := resourcesPerNode.Requests; requests != nil {
requirements.WithRequests(requests)
}
b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].
WithResources(requirements)
}
apply.UpsertEnvVars(
&b.Spec.ReplicatedJobs[i].Template.Spec.Template.Spec.Containers[j].Env,
apply.EnvVars(jobTrainer.Env...)...,
)
}
}
}

This design works well in official runtimes provided by Kubeflow Trainer. However, when it comes to custom runtimes written by users, they may find that they failed to override the configs in the target replicatedJob using if they assign different name to node container (like train, trainer). This is tricky, error prone, and hard to figure out.

In order to avoid this scenario, I add an validation condition in the (Cluster)TrainingRuntime's webhook, and also provide som e UTs for it.

/cc @kubeflow/wg-training-leads @astefanutti @rudeigerc

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@coveralls
Copy link

coveralls commented Jul 8, 2025

Pull Request Test Coverage Report for Build 18613219957

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 16 of 16 (100.0%) changed or added relevant lines in 1 file are covered.
  • 202 unchanged lines in 13 files lost coverage.
  • Overall coverage increased (+22.5%) to 52.344%

Files with Coverage Reduction New Missed Lines %
pkg/apply/apply.go 1 91.3%
pkg/runtime/framework/plugins/registry.go 2 0.0%
pkg/webhooks/trainjob_webhook.go 2 30.3%
pkg/controller/trainjob_controller.go 7 0.0%
pkg/runtime/framework/plugins/jobset/builder.go 7 0.0%
pkg/runtime/runtime.go 7 72.12%
pkg/webhooks/clustertrainingruntime_webhook.go 12 46.43%
pkg/runtime/framework/core/framework.go 15 84.69%
pkg/runtime/framework/plugins/torch/torch.go 22 84.18%
pkg/runtime/core/clustertrainingruntime.go 24 33.33%
Totals Coverage Status
Change from base Build 16133828859: 22.5%
Covered Lines: 1273
Relevant Lines: 2432

💛 - Coveralls

Signed-off-by: Electronic-Waste <2690692950@qq.com>
@Electronic-Waste
Copy link
Member Author

/retest

1 similar comment
@Electronic-Waste
Copy link
Member Author

/retest

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this PR @Electronic-Waste!
The validation looks good to me.
/assign @tenzen-y @astefanutti

// 2. model-initializer - model-initializer
// 3. trainer - node
hasRequiredContainer := false
for _, container := range rJob.Template.Spec.Template.Spec.Containers {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how difficult that would be to implement this validation in CEL?

Copy link
Member Author

@Electronic-Waste Electronic-Waste Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. These fields are defined in JobSetSpec:

// spec of the desired JobSet which will be created from TrainJob.
// +optional
Spec jobsetv1alpha2.JobSetSpec `json:"spec,omitempty"`

I'm not sure whether can we reference the container name in it (detecting corresponding label) and verify its value. In the contrast, I think webhook is more convenient and do not need to change the API by adding something like:

// +kubebuilder:validation:XValidation:rule="self > 0 || self in ['auto', 'cpu', 'gpu']", message="NumProcPerNode must be equal to auto, cpu, gpu, or int value"

Signed-off-by: Electronic-Waste <2690692950@qq.com>
@Electronic-Waste Electronic-Waste changed the title feat(webhook): Add validation for required containers in replicatedJobs feat(operator): Add validation for required containers in replicatedJobs Oct 18, 2025
@Electronic-Waste
Copy link
Member Author

@andreyvelich @astefanutti Sorry for the late response. I've addressed all of your comments. Please take a look.

@Electronic-Waste
Copy link
Member Author

@andreyvelich Maybe we can add this to milestone v2.1? I think it should be ready now:)

/milestone 2.1

@google-oss-prow
Copy link

@Electronic-Waste: The provided milestone is not valid for this repository. Milestones in this repository: [v2.0, v2.1]

Use /milestone clear to clear the milestone.

In response to this:

@andreyvelich Maybe we can add this to milestone v2.1? I think it should be ready now:)

/milestone 2.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Electronic-Waste
Copy link
Member Author

/milestone v2.1

@google-oss-prow google-oss-prow bot added this to the v2.1 milestone Oct 20, 2025
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks @Electronic-Waste!
/lgtm
/assign @astefanutti

@astefanutti
Copy link
Contributor

Thanks @Electronic-Waste!

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit b02eb5b into kubeflow:master Oct 20, 2025
33 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants