Skip to content

Conversation

sfeng1996
Copy link

What this PR does / why we need it:
Proposed additions:
Add Running status condition when the underlying JobSet/Jobs are actively executing
Add Pending status condition when the TrainJob is created but not yet running (e.g., waiting for resources, scheduling, etc.)

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes ##2713

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: sfeng1996 <sfeng1996@163.com>
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines +53 to +58
// TrainJobPending means that TrainJob is Pending.
TrainJobPending string = "Pending"

// TrainJobRunning means that TrainJob is Running.
TrainJobRunning string = "Running"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether the new condition(s) should stay consistent with the existing ones, i.e. modeling phases (vertices in the state machine) and not transitions (resp. edges).

In that case, it could possibly be covered by a new TrainJobStarted condition, whose reason would give details why the job is started or not, e.g. "components not created", "components initializing", "jobset startup policy in progress", "suspended", ...

@andreyvelich @tenzen-y @Electronic-Waste WDYT? Would a KEP be the recommend approach to capture the discussion on the new conditions or it's OK "inlined" in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should have conversations on what conditions we want to introduce for TrainJob.
We discussed before with @tenzen-y and @astefanutti that it is not easy to determine what is Running condition.
Open a new KEP is a good idea.

@sfeng1996 Do you want to drive it ?

/hold

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply. We've recently been using the Trainer for large-model fine-tuning, but the current TrainJob status is too simplistic and doesn't capture enough detailed information. I'm very interested in contributing to this feature and can open a new KEP to work on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think a KEP would help us have a clear understanding of trainjob status and its trainsition.

FYI, we've defined a status transition graph in https://github.com/kubeflow/trainer/tree/master/docs/proposals/2170-kubeflow-trainer-v2#state-transition

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. I have carefully reviewed your KEP and can we discuss how to add other states?such as under what conditions it is running? We need the running state to determine if the trainjob is running normally

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. I’m also running into issues because the TrainJob status is too simple. It would be really helpful if we could determine Pending / Running from the TrainJob status. Adding a new condition like in this PR would work, or even just propagating the JobSet status to the currently unused JobStatus would be fine:

type JobStatus struct {
// Name of the child Job.
Name string `json:"name"`
// Ready is the number of child Jobs where the number of ready pods and completed pods
// is greater than or equal to the total expected pod count for the child Job.
Ready int32 `json:"ready"`
// Succeeded is the number of successfully completed child Jobs.
Succeeded int32 `json:"succeeded"`
// Failed is the number of failed child Jobs.
Failed int32 `json:"failed"`
// Active is the number of child Jobs with at least 1 pod in a running or pending state
// which are not marked for deletion.
Active int32 `json:"active"`
// Suspended is the number of child Jobs which are in a suspended state.
Suspended int32 `json:"suspended"`
}

Is this still at the stage of creating a KEP?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toVersus I've open #2802 to address the TrainJob JobsStatus that's also a concern for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants