-
Notifications
You must be signed in to change notification settings - Fork 834
feat: support add pending, runing condition to trainjob #2725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: sfeng1996 <sfeng1996@163.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// TrainJobPending means that TrainJob is Pending. | ||
TrainJobPending string = "Pending" | ||
|
||
// TrainJobRunning means that TrainJob is Running. | ||
TrainJobRunning string = "Running" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether the new condition(s) should stay consistent with the existing ones, i.e. modeling phases (vertices in the state machine) and not transitions (resp. edges).
In that case, it could possibly be covered by a new TrainJobStarted
condition, whose reason would give details why the job is started or not, e.g. "components not created", "components initializing", "jobset startup policy in progress", "suspended", ...
@andreyvelich @tenzen-y @Electronic-Waste WDYT? Would a KEP be the recommend approach to capture the discussion on the new conditions or it's OK "inlined" in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we should have conversations on what conditions we want to introduce for TrainJob.
We discussed before with @tenzen-y and @astefanutti that it is not easy to determine what is Running condition.
Open a new KEP is a good idea.
@sfeng1996 Do you want to drive it ?
/hold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply. We've recently been using the Trainer for large-model fine-tuning, but the current TrainJob status is too simplistic and doesn't capture enough detailed information. I'm very interested in contributing to this feature and can open a new KEP to work on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think a KEP would help us have a clear understanding of trainjob status and its trainsition.
FYI, we've defined a status transition graph in https://github.com/kubeflow/trainer/tree/master/docs/proposals/2170-kubeflow-trainer-v2#state-transition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. I have carefully reviewed your KEP and can we discuss how to add other states?such as under what conditions it is running? We need the running state to determine if the trainjob is running normally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi. I’m also running into issues because the TrainJob status is too simple. It would be really helpful if we could determine Pending / Running from the TrainJob status. Adding a new condition like in this PR would work, or even just propagating the JobSet status to the currently unused JobStatus would be fine:
trainer/pkg/apis/trainer/v1alpha1/trainjob_types.go
Lines 314 to 334 in 4443f79
type JobStatus struct { | |
// Name of the child Job. | |
Name string `json:"name"` | |
// Ready is the number of child Jobs where the number of ready pods and completed pods | |
// is greater than or equal to the total expected pod count for the child Job. | |
Ready int32 `json:"ready"` | |
// Succeeded is the number of successfully completed child Jobs. | |
Succeeded int32 `json:"succeeded"` | |
// Failed is the number of failed child Jobs. | |
Failed int32 `json:"failed"` | |
// Active is the number of child Jobs with at least 1 pod in a running or pending state | |
// which are not marked for deletion. | |
Active int32 `json:"active"` | |
// Suspended is the number of child Jobs which are in a suspended state. | |
Suspended int32 `json:"suspended"` | |
} |
Is this still at the stage of creating a KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this PR does / why we need it:
Proposed additions:
Add Running status condition when the underlying JobSet/Jobs are actively executing
Add Pending status condition when the TrainJob is created but not yet running (e.g., waiting for resources, scheduling, etc.)
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes ##2713
Checklist: