-
Notifications
You must be signed in to change notification settings - Fork 41
feat(trainer): Add wait_for_job_status() API
#52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
wait_for_job_status() API
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Pull Request Test Coverage Report for Build 16650322582Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
wait_for_job_status() APIwait_for_job_status() API
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @andreyvelich!!
| job_statuses = { | ||
| constants.TRAINJOB_RUNNING, | ||
| constants.TRAINJOB_COMPLETE, | ||
| constants.TRAINJOB_FAILED, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TRAINJOB_CREATED should be here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there is no need to wait for Created status, since we automatically apply it after TrainJob is created in Kubernetes cluster:
| trainjob.status = constants.TRAINJOB_CREATED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but what if the user does client.wait_for_job_status("my-job", status={"Created"})?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also noticed that I incorrectly assign status to the TrainJob, fixed it here: 675f101
I think, you are right @kramaranya, it might make sense to allow user wait for Created status.
Although, we don't execute creation of TrainJob async, I can imagine that users want to run something like this:
https://github.com/kubeflow/sdk/blob/main/python/kubeflow/trainer/api/trainer_client.py#L234-L240
job_id = train(...)
job = wait_for_job_status(job_id, status={"Created"})
print(job.creation_timestamp)|
|
||
| return logs_dict | ||
|
|
||
| def wait_for_job_status( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also accept namespace parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, since the namespace is controlled by TrainerClient: https://github.com/kubeflow/sdk/blob/main/python/kubeflow/trainer/api/trainer_client.py#L38
We don't allow APIs (e.g. get_job()) to override it.
Ideally, we should abstract the namespace context from the SDK user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, makes sense now!
| Args: | ||
| name: Name of the TrainJob. | ||
| status: Set of expected statuses. It must be subset of Running, Complete, and Failed | ||
| statuses. | ||
| timeout: How many seconds to wait until TrainJob reaches one of the expected conditions. | ||
| polling_interval: The polling interval in seconds to check TrainJob status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding a verbose boolean parameter, which will print status updates? I think we also did smth similar in v1 -- https://github.com/kubeflow/trainer/blob/release-1.9/sdk/python/kubeflow/training/api/training_client.py#L1053-L1058.
That might look like:
client.wait_for_job_status("my-job", verbose=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, alternatively we can use logger.debug() to print TrainJob status updates.
Like in other APIs: https://github.com/kubeflow/sdk/blob/main/python/kubeflow/trainer/api/trainer_client.py#L250-L252
WDYT @astefanutti @kramaranya @Electronic-Waste @szaher @briangallagher ?
Co-authored-by: Anya Kramar <akramar@redhat.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
|
||
| return logs_dict | ||
|
|
||
| def wait_for_job_status( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that make sense to be a bit more generic and make it something like wait_for_job_condition, similar to what the kubectl wait command provides?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about it, however I am not sure if we want to expose complexity of CR condition to the SDK user.
For example, condition has type, status, and reason, etc. API that we don't really need to expose to the user (at least for now).
Thus, I suggest that we just expose single TrainJob status to make it easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, that makes sense. We can always add a "lower" level method later if needed.
| raise ValueError( | ||
| f"Expected status {status} must be a subset of {job_statuses}" | ||
| ) | ||
| for _ in range(round(timeout / polling_interval)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think a watch request be used instead of polling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me try that!
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
| w = watch.Watch() | ||
| try: | ||
| for event in w.stream( | ||
| self.core_api.list_namespaced_pod, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti I had to watch for Pod events since we don't push all events to TrainJob, but I think that would be fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich I'm not sure I understand why it's needed to watch for pods since the logic then gets the TrainJob with trainjob = self.get_job(name)? The event should have what's needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti The trick is that we don't expose events of running pods to the TrainJob.
We only have Completed and Failed condition for now.
And TrainJob Running condition is set when all training node Pods are running:
sdk/python/kubeflow/trainer/api/trainer_client.py
Lines 661 to 668 in 81e43fd
| else: | |
| # The TrainJob running status is defined when all training node (e.g. Pods) are running. | |
| num_running_nodes = sum( | |
| 1 | |
| for step in trainjob.steps | |
| if step.name.startswith(constants.NODE) | |
| and step.status == constants.TRAINJOB_RUNNING | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Ah I see. I missed this. That makes sense then.
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
|
/assign @kramaranya @astefanutti Let me know if the recent changes look good! |
|
/lgtm Thanks! |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I added the
wait_for_job_status()API to the TrainerClient.This API is similar to the v1 SDK.
Note on TrainJob Status
I updated the TrainJob Running status. I consider that when all training nodes (e.g. Pods) are in Running phase, we can transition TrainJob to the Running status. I believe that is good assumption for the majority of Trainer users.
Looking forward for your feedback!
/assign @kubeflow/kubeflow-trainer-team @astefanutti @szaher @kramaranya @eoinfennessy @briangallagher
TODO: