feat: support add pending, runing condition to trainjob #2725

sfeng1996 · 2025-07-09T06:50:24Z

What this PR does / why we need it:
Proposed additions:
Add Running status condition when the underlying JobSet/Jobs are actively executing
Add Pending status condition when the TrainJob is created but not yet running (e.g., waiting for resources, scheduling, etc.)

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes ##2713

Checklist:

Docs included if any changes are user facing

Signed-off-by: sfeng1996 <sfeng1996@163.com>

google-oss-prow · 2025-07-09T06:50:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

astefanutti · 2025-07-09T11:19:53Z

pkg/apis/trainer/v1alpha1/trainjob_types.go

+	// TrainJobPending means that TrainJob is Pending.
+	TrainJobPending string = "Pending"
+
+	// TrainJobRunning means that TrainJob is Running.
+	TrainJobRunning string = "Running"
+


I wonder whether the new condition(s) should stay consistent with the existing ones, i.e. modeling phases (vertices in the state machine) and not transitions (resp. edges).

In that case, it could possibly be covered by a new TrainJobStarted condition, whose reason would give details why the job is started or not, e.g. "components not created", "components initializing", "jobset startup policy in progress", "suspended", ...

@andreyvelich @tenzen-y @Electronic-Waste WDYT? Would a KEP be the recommend approach to capture the discussion on the new conditions or it's OK "inlined" in this PR?

Yes, I think we should have conversations on what conditions we want to introduce for TrainJob.
We discussed before with @tenzen-y and @astefanutti that it is not easy to determine what is Running condition.
Open a new KEP is a good idea.

@sfeng1996 Do you want to drive it ?

/hold

Thanks for the reply. We've recently been using the Trainer for large-model fine-tuning, but the current TrainJob status is too simplistic and doesn't capture enough detailed information. I'm very interested in contributing to this feature and can open a new KEP to work on it.

I also think a KEP would help us have a clear understanding of trainjob status and its trainsition.

FYI, we've defined a status transition graph in https://github.com/kubeflow/trainer/tree/master/docs/proposals/2170-kubeflow-trainer-v2#state-transition

Sorry for the late reply. I have carefully reviewed your KEP and can we discuss how to add other states？such as under what conditions it is running? We need the running state to determine if the trainjob is running normally

Hi. I’m also running into issues because the TrainJob status is too simple. It would be really helpful if we could determine Pending / Running from the TrainJob status. Adding a new condition like in this PR would work, or even just propagating the JobSet status to the currently unused JobStatus would be fine:

trainer/pkg/apis/trainer/v1alpha1/trainjob_types.go

Lines 314 to 334 in 4443f79

type JobStatus struct {

// Name of the child Job.

Name string `json:"name"`

// Ready is the number of child Jobs where the number of ready pods and completed pods

// is greater than or equal to the total expected pod count for the child Job.

Ready int32 `json:"ready"`

// Succeeded is the number of successfully completed child Jobs.

Succeeded int32 `json:"succeeded"`

// Failed is the number of failed child Jobs.

Failed int32 `json:"failed"`

// Active is the number of child Jobs with at least 1 pod in a running or pending state

// which are not marked for deletion.

Active int32 `json:"active"`

// Suspended is the number of child Jobs which are in a suspended state.

Suspended int32 `json:"suspended"`

}

Is this still at the stage of creating a KEP?

@toVersus I've open #2802 to address the TrainJob JobsStatus that's also a concern for us.

feat: support add pending, runing condition to trainjob

8c9bd58

Signed-off-by: sfeng1996 <sfeng1996@163.com>

google-oss-prow bot requested review from astefanutti and kuizhiqing July 9, 2025 06:50

google-oss-prow bot added the size/M label Jul 9, 2025

astefanutti reviewed Jul 9, 2025

View reviewed changes

google-oss-prow bot added the do-not-merge/hold label Jul 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support add pending, runing condition to trainjob #2725

feat: support add pending, runing condition to trainjob #2725

Uh oh!

sfeng1996 commented Jul 9, 2025

Uh oh!

google-oss-prow bot commented Jul 9, 2025

Uh oh!

astefanutti Jul 9, 2025

Uh oh!

andreyvelich Jul 9, 2025

Uh oh!

sfeng1996 Jul 10, 2025

Uh oh!

Electronic-Waste Jul 10, 2025

Uh oh!

sfeng1996 Jul 25, 2025

Uh oh!

toVersus Aug 28, 2025

Uh oh!

astefanutti Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	type JobStatus struct {
	// Name of the child Job.
	Name string `json:"name"`

	// Ready is the number of child Jobs where the number of ready pods and completed pods
	// is greater than or equal to the total expected pod count for the child Job.
	Ready int32 `json:"ready"`

	// Succeeded is the number of successfully completed child Jobs.
	Succeeded int32 `json:"succeeded"`

	// Failed is the number of failed child Jobs.
	Failed int32 `json:"failed"`

	// Active is the number of child Jobs with at least 1 pod in a running or pending state
	// which are not marked for deletion.
	Active int32 `json:"active"`

	// Suspended is the number of child Jobs which are in a suspended state.
	Suspended int32 `json:"suspended"`
	}

feat: support add pending, runing condition to trainjob #2725

Are you sure you want to change the base?

feat: support add pending, runing condition to trainjob #2725

Uh oh!

Conversation

sfeng1996 commented Jul 9, 2025

Uh oh!

google-oss-prow bot commented Jul 9, 2025

Uh oh!

astefanutti Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

andreyvelich Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

sfeng1996 Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Electronic-Waste Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

sfeng1996 Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

toVersus Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

astefanutti Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants