Skip to content

Conversation

@andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Jul 28, 2025

I added the wait_for_job_status() API to the TrainerClient.

This API is similar to the v1 SDK.

Note on TrainJob Status

I updated the TrainJob Running status. I consider that when all training nodes (e.g. Pods) are in Running phase, we can transition TrainJob to the Running status. I believe that is good assumption for the majority of Trainer users.

Looking forward for your feedback!

/assign @kubeflow/kubeflow-trainer-team @astefanutti @szaher @kramaranya @eoinfennessy @briangallagher

TODO:

  • Update unit tests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich andreyvelich changed the title [WIP] feat(trainer): Add wait_for_job_status() API [WIP] feat(trainer): Add wait_for_job_status() API Jul 28, 2025
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@coveralls
Copy link

coveralls commented Jul 29, 2025

Pull Request Test Coverage Report for Build 16650322582

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 28 of 34 (82.35%) changed or added relevant lines in 2 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.9%) to 65.509%

Changes Missing Coverage Covered Lines Changed/Added Lines %
python/kubeflow/trainer/api/trainer_client.py 26 32 81.25%
Files with Coverage Reduction New Missed Lines %
python/kubeflow/trainer/api/trainer_client.py 2 74.87%
Totals Coverage Status
Change from base Build 16567361500: 0.9%
Covered Lines: 264
Relevant Lines: 403

💛 - Coveralls

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich andreyvelich changed the title [WIP] feat(trainer): Add wait_for_job_status() API feat(trainer): Add wait_for_job_status() API Jul 29, 2025
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @andreyvelich!!

Comment on lines 463 to 467
job_statuses = {
constants.TRAINJOB_RUNNING,
constants.TRAINJOB_COMPLETE,
constants.TRAINJOB_FAILED,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRAINJOB_CREATED should be here, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is no need to wait for Created status, since we automatically apply it after TrainJob is created in Kubernetes cluster:

trainjob.status = constants.TRAINJOB_CREATED

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but what if the user does client.wait_for_job_status("my-job", status={"Created"})?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed that I incorrectly assign status to the TrainJob, fixed it here: 675f101

I think, you are right @kramaranya, it might make sense to allow user wait for Created status.

Although, we don't execute creation of TrainJob async, I can imagine that users want to run something like this:
https://github.com/kubeflow/sdk/blob/main/python/kubeflow/trainer/api/trainer_client.py#L234-L240

job_id = train(...)
job = wait_for_job_status(job_id, status={"Created"})
print(job.creation_timestamp)


return logs_dict

def wait_for_job_status(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also accept namespace parameter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, since the namespace is controlled by TrainerClient: https://github.com/kubeflow/sdk/blob/main/python/kubeflow/trainer/api/trainer_client.py#L38
We don't allow APIs (e.g. get_job()) to override it.

Ideally, we should abstract the namespace context from the SDK user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, makes sense now!

Comment on lines 446 to 451
Args:
name: Name of the TrainJob.
status: Set of expected statuses. It must be subset of Running, Complete, and Failed
statuses.
timeout: How many seconds to wait until TrainJob reaches one of the expected conditions.
polling_interval: The polling interval in seconds to check TrainJob status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding a verbose boolean parameter, which will print status updates? I think we also did smth similar in v1 -- https://github.com/kubeflow/trainer/blob/release-1.9/sdk/python/kubeflow/training/api/training_client.py#L1053-L1058.

That might look like:
client.wait_for_job_status("my-job", verbose=True)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, alternatively we can use logger.debug() to print TrainJob status updates.
Like in other APIs: https://github.com/kubeflow/sdk/blob/main/python/kubeflow/trainer/api/trainer_client.py#L250-L252

WDYT @astefanutti @kramaranya @Electronic-Waste @szaher @briangallagher ?

andreyvelich and others added 3 commits July 31, 2025 00:08
Co-authored-by: Anya Kramar <akramar@redhat.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

return logs_dict

def wait_for_job_status(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that make sense to be a bit more generic and make it something like wait_for_job_condition, similar to what the kubectl wait command provides?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it, however I am not sure if we want to expose complexity of CR condition to the SDK user.
For example, condition has type, status, and reason, etc. API that we don't really need to expose to the user (at least for now).

Thus, I suggest that we just expose single TrainJob status to make it easier to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, that makes sense. We can always add a "lower" level method later if needed.

raise ValueError(
f"Expected status {status} must be a subset of {job_statuses}"
)
for _ in range(round(timeout / polling_interval)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think a watch request be used instead of polling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me try that!

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Jul 31, 2025
w = watch.Watch()
try:
for event in w.stream(
self.core_api.list_namespaced_pod,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti I had to watch for Pod events since we don't push all events to TrainJob, but I think that would be fine for now.

Copy link
Contributor

@astefanutti astefanutti Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I'm not sure I understand why it's needed to watch for pods since the logic then gets the TrainJob with trainjob = self.get_job(name)? The event should have what's needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti The trick is that we don't expose events of running pods to the TrainJob.
We only have Completed and Failed condition for now.
And TrainJob Running condition is set when all training node Pods are running:

else:
# The TrainJob running status is defined when all training node (e.g. Pods) are running.
num_running_nodes = sum(
1
for step in trainjob.steps
if step.name.startswith(constants.NODE)
and step.status == constants.TRAINJOB_RUNNING
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Ah I see. I missed this. That makes sense then.

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

/assign @kramaranya @astefanutti Let me know if the recent changes look good!

@astefanutti
Copy link
Contributor

/lgtm

Thanks!

@andreyvelich
Copy link
Member Author

/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 9bd64d8 into kubeflow:main Jul 31, 2025
8 checks passed
@google-oss-prow google-oss-prow bot added this to the v0.1 milestone Jul 31, 2025
@andreyvelich andreyvelich deleted the wait-for-job-api branch July 31, 2025 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants