-
Notifications
You must be signed in to change notification settings - Fork 834
feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus #2802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 18313170282Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @astefanutti!
Please can you also add the integration: https://github.com/kubeflow/trainer/blob/master/test/integration/controller/trainjob_controller_test.go#L74 and e2e test: https://github.com/kubeflow/trainer/blob/master/test/e2e/e2e_test.go#L67, where we check that JobsStatus.
} | ||
if len(f.jobsStatusPlugins) != 0 { | ||
return f.jobsStatusPlugins[0].JobsStatus(ctx, trainJob) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y @astefanutti What is the use-case when we can have more than one terminalConditionPlugin or jobsStatusPlugin ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich that would be if two plugins implement the same plugin interface so they would conflict.
The check is currently performed for each call, maybe we could do that during the runtime framework initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I am trying to understand when users can define more than 1 implementation for such plugins ?
Terminal condition or job status should always be defined by one plugin, unless I am missing something ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually yes, but practically it is possible to have multiple plugins implementing the same interface like framework.TerminalConditionPlugin
.
We could imagine this happens if we had other backend than JobSet, e.g., LWS and both would be made active by "mistake".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both would be made active by "mistake".
If that is set by mistake, shall we validate that only one plugin can be activated for such interfaces during framework initialization ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, make those plugins as not arrays here:
trainer/pkg/runtime/framework/core/framework.go
Lines 48 to 49 in 389023c
terminalConditionPlugins []framework.TerminalConditionPlugin | |
jobsStatusPlugins []framework.JobsStatusPlugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is set by mistake, shall we validate that only one plugin can be activated for such interfaces during framework initialization ?
I agree, that's what I meant when I said "maybe we could do that during the runtime framework initialization.".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich I've updated the logic for those "singleton" plugins so it fails-fast on at init time.
@andreyvelich I've added/updated integration and e2e tests. PTAL when you can. |
} | ||
if len(f.jobsStatusPlugins) != 0 { | ||
return f.jobsStatusPlugins[0].JobsStatus(ctx, trainJob) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, make those plugins as not arrays here:
trainer/pkg/runtime/framework/core/framework.go
Lines 48 to 49 in 389023c
terminalConditionPlugins []framework.TerminalConditionPlugin | |
jobsStatusPlugins []framework.JobsStatusPlugin |
Name: status.Name, | ||
Ready: status.Ready, | ||
Succeeded: status.Succeeded, | ||
Failed: status.Failed, | ||
Active: status.Active, | ||
Suspended: status.Suspended, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whether we will introduce a breaking change after this issue: kubernetes-sigs/jobset#723 ?
I proposed that we don't use Succeeded
as Job condition in rJob.
Any thoughts @astefanutti @tenzen-y @kannon92 @kubeflow/kubeflow-trainer-team ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this would be better to be consistent across projects, thought I'd suggest not to hold this PR as it'll be a breaking change anyway and have it addressed separately so we can clearly "release-note" it as a breaking change. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed offline in Slack, we will use Succeeded condition for now.
We will change this API to Complete in a future version, once JobSet migrates its APIs.
c25b925
to
3e4fc1a
Compare
/retest |
/milestone v2.1 |
errorTooManyTerminalConditionPlugin = errors.New("too many TerminalCondition plugins are registered") | ||
errorTooManyJobsStatusPlugin = errors.New("too many JobsStatus plugins are registered") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we can only register one plugin for TerminalCondition and JobsStatus, do we need these errors ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually the error that's returned at init time in case more than one plugin is being registered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it won't be possible since we initialize the Framework here:
trainer/pkg/runtime/framework/core/framework.go
Lines 53 to 55 in 848a4bd
f := &Framework{ | |
registry: r, | |
} |
So this condition will never be executed:
trainer/pkg/runtime/framework/core/framework.go
Lines 82 to 85 in 848a4bd
if p, ok := plugin.(framework.TerminalConditionPlugin); ok { | |
if f.terminalConditionPlugin != nil { | |
return nil, errorTooManyTerminalConditionPlugin | |
} |
Am I missing something ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right and that's where the errors are returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti Can you explain the flow when f.terminalConditionPlugin != nil
when we execute framework's New()
function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I am wondering if there is any need to distinguish plugins that we define in the Registry and plugins that are part of those "registered" plugins (e.g. TerminalCondition plugin is part of JobSet plugin).
I could not catch the reason why you want to do that.
I think the current @astefanutti 's implementations are a straightforward way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I was asking what does "Plugin" mean in the Extension Framework from your point of view ?
And how are we going to allow users to extend the framework with their own plugins ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, got it. The trainer provides the 2 level abstraction layers which is Runtime Framework
and Plugin
.
But, sometimes, the plugin is only for a specific Runtime Framework
like the JobSet plugin.
I guess that we might want to have mechanisms to represent the tie between the Runtime Framework and the Plugin, but it would be better to avoid adding any validation relationship to allow them to reuse the implemented Plugin wherever the Runtime Framework.
Anyway, we can consider that as an another enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
errorTooManyTerminalConditionPlugin = errors.New("too many TerminalCondition plugins are registered") | ||
errorTooManyJobsStatusPlugin = errors.New("too many JobsStatus plugins are registered") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I am wondering if there is any need to distinguish plugins that we define in the Registry and plugins that are part of those "registered" plugins (e.g. TerminalCondition plugin is part of JobSet plugin).
I could not catch the reason why you want to do that.
I think the current @astefanutti 's implementations are a straightforward way.
pkg/runtime/interface.go
Outdated
TerminalCondition(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, error) | ||
JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TerminalCondition(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, error) | |
JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error) | |
UnderlyingJobState(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, []trainer.JobStatus, error) |
I think that TerminalCondition
and JobsStatus
responsibilities are almost same which is just propagate the Job state from underlying one to TrainJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense for me to consolidate those two plugins together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've folded the two plugins together. PTAL.
pkg/runtime/framework/interface.go
Outdated
type TerminalConditionPlugin interface { | ||
Plugin | ||
TerminalCondition(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, error) | ||
} | ||
|
||
type JobsStatusPlugin interface { | ||
Plugin | ||
JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto related to consolidation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with the two plugins together. PTAL.
// TODO (tenzen-y): Once we provide the Configuration API, we should validate which plugin should have terminalCondition execution points. | ||
if len(f.terminalConditionPlugins) > 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why we can remove the verification of the number of plugins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed here, users won't be able to register more than one plugin for terminal condition: #2802 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the check is now moved at initialization, rather than at (first) execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. The new way makes more sense.
Thank you!
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This PR adds the JobsStatusPlugin runtime interface and implements it in the JobSet plugin so the TrainJob JobsStatus is synced from JobSet ReplicatedJobsStatus.
/cc @kubeflow/kubeflow-trainer-team @toVersus
Checklist: