Skip to content

Conversation

astefanutti
Copy link
Contributor

@astefanutti astefanutti commented Aug 28, 2025

What this PR does / why we need it:

This PR adds the JobsStatusPlugin runtime interface and implements it in the JobSet plugin so the TrainJob JobsStatus is synced from JobSet ReplicatedJobsStatus.

/cc @kubeflow/kubeflow-trainer-team @toVersus

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Aug 28, 2025

Pull Request Test Coverage Report for Build 18313170282

Details

  • 8 of 38 (21.05%) changed or added relevant lines in 5 files are covered.
  • 3 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.3%) to 54.234%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/runtime/core/clustertrainingruntime.go 0 2 0.0%
pkg/runtime/core/trainingruntime.go 0 2 0.0%
pkg/controller/trainjob_controller.go 0 7 0.0%
pkg/runtime/framework/plugins/jobset/jobset.go 0 19 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller/trainjob_controller.go 1 0.0%
pkg/runtime/core/clustertrainingruntime.go 1 44.44%
pkg/runtime/core/trainingruntime.go 1 69.83%
Totals Coverage Status
Change from base Build 18271389532: -0.3%
Covered Lines: 1204
Relevant Lines: 2220

💛 - Coveralls

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 161 to 163
}
if len(f.jobsStatusPlugins) != 0 {
return f.jobsStatusPlugins[0].JobsStatus(ctx, trainJob)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @astefanutti What is the use-case when we can have more than one terminalConditionPlugin or jobsStatusPlugin ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich that would be if two plugins implement the same plugin interface so they would conflict.

The check is currently performed for each call, maybe we could do that during the runtime framework initialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I am trying to understand when users can define more than 1 implementation for such plugins ?
Terminal condition or job status should always be defined by one plugin, unless I am missing something ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually yes, but practically it is possible to have multiple plugins implementing the same interface like framework.TerminalConditionPlugin.

We could imagine this happens if we had other backend than JobSet, e.g., LWS and both would be made active by "mistake".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both would be made active by "mistake".

If that is set by mistake, shall we validate that only one plugin can be activated for such interfaces during framework initialization ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, make those plugins as not arrays here:

terminalConditionPlugins []framework.TerminalConditionPlugin
jobsStatusPlugins []framework.JobsStatusPlugin
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is set by mistake, shall we validate that only one plugin can be activated for such interfaces during framework initialization ?

I agree, that's what I meant when I said "maybe we could do that during the runtime framework initialization.".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I've updated the logic for those "singleton" plugins so it fails-fast on at init time.

@astefanutti
Copy link
Contributor Author

Thank you for this @astefanutti! Please can you also add the integration: https://github.com/kubeflow/trainer/blob/master/test/integration/controller/trainjob_controller_test.go#L74 and e2e test: https://github.com/kubeflow/trainer/blob/master/test/e2e/e2e_test.go#L67, where we check that JobsStatus.

@andreyvelich I've added/updated integration and e2e tests. PTAL when you can.

Comment on lines 161 to 163
}
if len(f.jobsStatusPlugins) != 0 {
return f.jobsStatusPlugins[0].JobsStatus(ctx, trainJob)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, make those plugins as not arrays here:

terminalConditionPlugins []framework.TerminalConditionPlugin
jobsStatusPlugins []framework.JobsStatusPlugin
?

Comment on lines +316 to +322
Name: status.Name,
Ready: status.Ready,
Succeeded: status.Succeeded,
Failed: status.Failed,
Active: status.Active,
Suspended: status.Suspended,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether we will introduce a breaking change after this issue: kubernetes-sigs/jobset#723 ?
I proposed that we don't use Succeeded as Job condition in rJob.
Any thoughts @astefanutti @tenzen-y @kannon92 @kubeflow/kubeflow-trainer-team ?

Copy link
Contributor Author

@astefanutti astefanutti Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this would be better to be consistent across projects, thought I'd suggest not to hold this PR as it'll be a breaking change anyway and have it addressed separately so we can clearly "release-note" it as a breaking change. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline in Slack, we will use Succeeded condition for now.
We will change this API to Complete in a future version, once JobSet migrates its APIs.

@astefanutti astefanutti force-pushed the pr-25 branch 2 times, most recently from c25b925 to 3e4fc1a Compare September 2, 2025 07:18
@astefanutti
Copy link
Contributor Author

/retest

@andreyvelich
Copy link
Member

/milestone v2.1

@google-oss-prow google-oss-prow bot added this to the v2.1 milestone Sep 26, 2025
Comment on lines 35 to 37
errorTooManyTerminalConditionPlugin = errors.New("too many TerminalCondition plugins are registered")
errorTooManyJobsStatusPlugin = errors.New("too many JobsStatus plugins are registered")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can only register one plugin for TerminalCondition and JobsStatus, do we need these errors ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually the error that's returned at init time in case more than one plugin is being registered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it won't be possible since we initialize the Framework here:

f := &Framework{
registry: r,
}

So this condition will never be executed:

if p, ok := plugin.(framework.TerminalConditionPlugin); ok {
if f.terminalConditionPlugin != nil {
return nil, errorTooManyTerminalConditionPlugin
}

Am I missing something ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right and that's where the errors are returned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Can you explain the flow when f.terminalConditionPlugin != nil when we execute framework's New() function ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I am wondering if there is any need to distinguish plugins that we define in the Registry and plugins that are part of those "registered" plugins (e.g. TerminalCondition plugin is part of JobSet plugin).

I could not catch the reason why you want to do that.
I think the current @astefanutti 's implementations are a straightforward way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I was asking what does "Plugin" mean in the Extension Framework from your point of view ?
And how are we going to allow users to extend the framework with their own plugins ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. The trainer provides the 2 level abstraction layers which is Runtime Framework and Plugin.
But, sometimes, the plugin is only for a specific Runtime Framework like the JobSet plugin.

I guess that we might want to have mechanisms to represent the tie between the Runtime Framework and the Plugin, but it would be better to avoid adding any validation relationship to allow them to reuse the implemented Plugin wherever the Runtime Framework.

Anyway, we can consider that as an another enhancement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Comment on lines 35 to 37
errorTooManyTerminalConditionPlugin = errors.New("too many TerminalCondition plugins are registered")
errorTooManyJobsStatusPlugin = errors.New("too many JobsStatus plugins are registered")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I am wondering if there is any need to distinguish plugins that we define in the Registry and plugins that are part of those "registered" plugins (e.g. TerminalCondition plugin is part of JobSet plugin).

I could not catch the reason why you want to do that.
I think the current @astefanutti 's implementations are a straightforward way.

Comment on lines 41 to 42
TerminalCondition(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, error)
JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TerminalCondition(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, error)
JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error)
UnderlyingJobState(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, []trainer.JobStatus, error)

I think that TerminalCondition and JobsStatus responsibilities are almost same which is just propagate the Job state from underlying one to TrainJob.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense for me to consolidate those two plugins together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've folded the two plugins together. PTAL.

Comment on lines 64 to 72
type TerminalConditionPlugin interface {
Plugin
TerminalCondition(ctx context.Context, trainJob *trainer.TrainJob) (*metav1.Condition, error)
}

type JobsStatusPlugin interface {
Plugin
JobsStatus(ctx context.Context, trainJob *trainer.TrainJob) ([]trainer.JobStatus, error)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto related to consolidation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with the two plugins together. PTAL.

Comment on lines -145 to -146
// TODO (tenzen-y): Once we provide the Configuration API, we should validate which plugin should have terminalCondition execution points.
if len(f.terminalConditionPlugins) > 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we can remove the verification of the number of plugins?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed here, users won't be able to register more than one plugin for terminal condition: #2802 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the check is now moved at initialization, rather than at (first) execution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. The new way makes more sense.
Thank you!

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thank you!
/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit c1bc94d into kubeflow:master Oct 7, 2025
26 of 27 checks passed
@astefanutti astefanutti deleted the pr-25 branch October 7, 2025 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants