-
Notifications
You must be signed in to change notification settings - Fork 834
feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus #2802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
223f1b5
783630d
bfe4234
a1287d3
efc5e9d
ab46fa6
fc55beb
7d34b6e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -32,7 +32,10 @@ import ( | |||||||||||||||
| index "github.com/kubeflow/trainer/v2/pkg/runtime/indexer" | ||||||||||||||||
| ) | ||||||||||||||||
|
|
||||||||||||||||
| var errorTooManyTerminalConditionPlugin = errors.New("too many TerminalCondition plugins are registered") | ||||||||||||||||
| var ( | ||||||||||||||||
| errorTooManyTerminalConditionPlugin = errors.New("too many TerminalCondition plugins are registered") | ||||||||||||||||
| errorTooManyJobsStatusPlugin = errors.New("too many JobsStatus plugins are registered") | ||||||||||||||||
|
||||||||||||||||
| f := &Framework{ | |
| registry: r, | |
| } |
So this condition will never be executed:
trainer/pkg/runtime/framework/core/framework.go
Lines 82 to 85 in 848a4bd
| if p, ok := plugin.(framework.TerminalConditionPlugin); ok { | |
| if f.terminalConditionPlugin != nil { | |
| return nil, errorTooManyTerminalConditionPlugin | |
| } |
Am I missing something ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right and that's where the errors are returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti Can you explain the flow when f.terminalConditionPlugin != nil when we execute framework's New() function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I am wondering if there is any need to distinguish plugins that we define in the Registry and plugins that are part of those "registered" plugins (e.g. TerminalCondition plugin is part of JobSet plugin).
I could not catch the reason why you want to do that.
I think the current @astefanutti 's implementations are a straightforward way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I was asking what does "Plugin" mean in the Extension Framework from your point of view ?
And how are we going to allow users to extend the framework with their own plugins ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. The trainer provides the 2 level abstraction layers which is Runtime Framework and Plugin.
But, sometimes, the plugin is only for a specific Runtime Framework like the JobSet plugin.
I guess that we might want to have mechanisms to represent the tie between the Runtime Framework and the Plugin, but it would be better to avoid adding any validation relationship to allow them to reuse the implemented Plugin wherever the Runtime Framework.
Anyway, we can consider that as an another enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we can remove the verification of the number of plugins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed here, users won't be able to register more than one plugin for terminal condition: #2802 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the check is now moved at initialization, rather than at (first) execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. The new way makes more sense.
Thank you!
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -65,3 +65,8 @@ type TerminalConditionPlugin interface { | |
| Plugin | ||
| TerminalCondition(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, error) | ||
| } | ||
|
|
||
| type JobsStatusPlugin interface { | ||
| Plugin | ||
| JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error) | ||
| } | ||
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -63,6 +63,7 @@ var _ framework.WatchExtensionPlugin = (*JobSet)(nil) | |
| var _ framework.PodNetworkPlugin = (*JobSet)(nil) | ||
| var _ framework.ComponentBuilderPlugin = (*JobSet)(nil) | ||
| var _ framework.TerminalConditionPlugin = (*JobSet)(nil) | ||
| var _ framework.JobsStatusPlugin = (*JobSet)(nil) | ||
| var _ framework.CustomValidationPlugin = (*JobSet)(nil) | ||
|
|
||
| const Name = constants.JobSetKind | ||
|
|
@@ -304,3 +305,22 @@ func (j *JobSet) TerminalCondition(ctx context.Context, trainJob *trainer.TrainJ | |
| } | ||
| return nil, nil | ||
| } | ||
|
|
||
| func (j *JobSet) JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error) { | ||
| jobSet := &jobsetv1alpha2.JobSet{} | ||
| if err := j.client.Get(ctx, client.ObjectKeyFromObject(trainJob), jobSet); err != nil { | ||
| return nil, err | ||
| } | ||
| var statuses []trainer.JobStatus | ||
| for _, status := range jobSet.Status.ReplicatedJobsStatus { | ||
| statuses = append(statuses, trainer.JobStatus{ | ||
| Name: status.Name, | ||
| Ready: status.Ready, | ||
| Succeeded: status.Succeeded, | ||
| Failed: status.Failed, | ||
| Active: status.Active, | ||
| Suspended: status.Suspended, | ||
|
Comment on lines
+311
to
+316
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am wondering whether we will introduce a breaking change after this issue: kubernetes-sigs/jobset#723 ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree this would be better to be consistent across projects, thought I'd suggest not to hold this PR as it'll be a breaking change anyway and have it addressed separately so we can clearly "release-note" it as a breaking change. WDYT? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As we discussed offline in Slack, we will use Succeeded condition for now. |
||
| }) | ||
| } | ||
| return statuses, nil | ||
| } | ||
| Original file line number | Diff line number | Diff line change | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
@@ -39,6 +39,7 @@ type Runtime interface { | |||||||
| NewObjects(ctx context.Context, trainJob *trainer.TrainJob) ([]any, error) | ||||||||
| RuntimeInfo(trainJob *trainer.TrainJob, runtimeTemplateSpec any, mlPolicy *trainer.MLPolicy, podGroupPolicy *trainer.PodGroupPolicy) (*Info, error) | ||||||||
| TerminalCondition(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, error) | ||||||||
| JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error) | ||||||||
|
||||||||
| TerminalCondition(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, error) | |
| JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error) | |
| UnderlyingJobState(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, []trainer.JobStatus, error) |
I think that TerminalCondition and JobsStatus responsibilities are almost same which is just propagate the Job state from underlying one to TrainJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense for me to consolidate those two plugins together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've folded the two plugins together. PTAL.
Uh oh!
There was an error while loading. Please reload this page.