feat(trainer): Add `wait_for_job_status()` API #52

andreyvelich · 2025-07-28T15:59:43Z

I added the wait_for_job_status() API to the TrainerClient.

This API is similar to the v1 SDK.

Note on TrainJob Status

I updated the TrainJob Running status. I consider that when all training nodes (e.g. Pods) are in Running phase, we can transition TrainJob to the Running status. I believe that is good assumption for the majority of Trainer users.

Looking forward for your feedback!

/assign @kubeflow/kubeflow-trainer-team @astefanutti @szaher @kramaranya @eoinfennessy @briangallagher

TODO:

Update unit tests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

coveralls · 2025-07-29T01:47:26Z

Pull Request Test Coverage Report for Build 16650322582

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

28 of 34 (82.35%) changed or added relevant lines in 2 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.9%) to 65.509%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
python/kubeflow/trainer/api/trainer_client.py	26	32	81.25%

Files with Coverage Reduction	New Missed Lines	%
python/kubeflow/trainer/api/trainer_client.py	2	74.87%

Totals
Change from base Build 16567361500:	0.9%
Covered Lines:	264
Relevant Lines:	403

💛 - Coveralls

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

kramaranya

Thank you @andreyvelich!!

kramaranya · 2025-07-30T20:44:34Z

python/kubeflow/trainer/api/trainer_client.py

+        job_statuses = {
+            constants.TRAINJOB_RUNNING,
+            constants.TRAINJOB_COMPLETE,
+            constants.TRAINJOB_FAILED,
+        }


TRAINJOB_CREATED should be here, right?

Actually, there is no need to wait for Created status, since we automatically apply it after TrainJob is created in Kubernetes cluster:

sdk/python/kubeflow/trainer/api/trainer_client.py

Line 654 in 476b1b2

trainjob.status = constants.TRAINJOB_CREATED

Yeah, but what if the user does client.wait_for_job_status("my-job", status={"Created"})?

I also noticed that I incorrectly assign status to the TrainJob, fixed it here: 675f101

I think, you are right @kramaranya, it might make sense to allow user wait for Created status.

Although, we don't execute creation of TrainJob async, I can imagine that users want to run something like this:
https://github.com/kubeflow/sdk/blob/main/python/kubeflow/trainer/api/trainer_client.py#L234-L240

job_id = train(...) job = wait_for_job_status(job_id, status={"Created"}) print(job.creation_timestamp)

python/kubeflow/trainer/api/trainer_client.py

kramaranya · 2025-07-30T20:53:14Z

python/kubeflow/trainer/api/trainer_client.py


        return logs_dict

+    def wait_for_job_status(


Shall we also accept namespace parameter?

No, since the namespace is controlled by TrainerClient: https://github.com/kubeflow/sdk/blob/main/python/kubeflow/trainer/api/trainer_client.py#L38
We don't allow APIs (e.g. get_job()) to override it.

Ideally, we should abstract the namespace context from the SDK user.

Oh, I see, makes sense now!

kramaranya · 2025-07-30T21:27:45Z

python/kubeflow/trainer/api/trainer_client.py

+        Args:
+            name: Name of the TrainJob.
+            status: Set of expected statuses. It must be subset of Running, Complete, and Failed
+                statuses.
+            timeout: How many seconds to wait until TrainJob reaches one of the expected conditions.
+            polling_interval: The polling interval in seconds to check TrainJob status.


What do you think about adding a verbose boolean parameter, which will print status updates? I think we also did smth similar in v1 -- https://github.com/kubeflow/trainer/blob/release-1.9/sdk/python/kubeflow/training/api/training_client.py#L1053-L1058.

That might look like:
client.wait_for_job_status("my-job", verbose=True)

We can, alternatively we can use logger.debug() to print TrainJob status updates.
Like in other APIs: https://github.com/kubeflow/sdk/blob/main/python/kubeflow/trainer/api/trainer_client.py#L250-L252

WDYT @astefanutti @kramaranya @Electronic-Waste @szaher @briangallagher ?

Co-authored-by: Anya Kramar <akramar@redhat.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

astefanutti · 2025-07-31T07:45:20Z

python/kubeflow/trainer/api/trainer_client.py


        return logs_dict

+    def wait_for_job_status(


Would that make sense to be a bit more generic and make it something like wait_for_job_condition, similar to what the kubectl wait command provides?

I was thinking about it, however I am not sure if we want to expose complexity of CR condition to the SDK user.
For example, condition has type, status, and reason, etc. API that we don't really need to expose to the user (at least for now).

Thus, I suggest that we just expose single TrainJob status to make it easier to read.

Sounds good, that makes sense. We can always add a "lower" level method later if needed.

astefanutti · 2025-07-31T07:50:22Z

python/kubeflow/trainer/api/trainer_client.py

+            raise ValueError(
+                f"Expected status {status} must be a subset of {job_statuses}"
+            )
+        for _ in range(round(timeout / polling_interval)):


Do you think a watch request be used instead of polling?

Sure, let me try that!

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-07-31T13:25:47Z

python/kubeflow/trainer/api/trainer_client.py

+        w = watch.Watch()
+        try:
+            for event in w.stream(
+                self.core_api.list_namespaced_pod,


@astefanutti I had to watch for Pod events since we don't push all events to TrainJob, but I think that would be fine for now.

@andreyvelich I'm not sure I understand why it's needed to watch for pods since the logic then gets the TrainJob with trainjob = self.get_job(name)? The event should have what's needed?

@astefanutti The trick is that we don't expose events of running pods to the TrainJob.
We only have Completed and Failed condition for now.
And TrainJob Running condition is set when all training node Pods are running:

sdk/python/kubeflow/trainer/api/trainer_client.py

Lines 661 to 668 in 81e43fd

else:

# The TrainJob running status is defined when all training node (e.g. Pods) are running.

num_running_nodes = sum(

1

for step in trainjob.steps

if step.name.startswith(constants.NODE)

and step.status == constants.TRAINJOB_RUNNING

)

@andreyvelich Ah I see. I missed this. That makes sense then.

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-07-31T16:30:31Z

/assign @kramaranya @astefanutti Let me know if the recent changes look good!

astefanutti · 2025-07-31T16:35:26Z

/lgtm

Thanks!

andreyvelich · 2025-07-31T16:51:04Z

/approve

google-oss-prow · 2025-07-31T16:51:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

feat(trainer): Add wait_for_job_status() API

35da385

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot added the do-not-merge/work-in-progress label Jul 28, 2025

google-oss-prow bot requested review from Electronic-Waste and astefanutti July 28, 2025 15:59

google-oss-prow bot added the size/L label Jul 28, 2025

andreyvelich changed the title ~~[WIP] feat(trainer): Add wait_for_job_status() API~~ [WIP] feat(trainer): Add wait_for_job_status() API Jul 28, 2025

Fix unit tests

76fd9a2

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Add tests for wait_for_job_status

dc1b011

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich changed the title ~~[WIP] feat(trainer): Add wait_for_job_status() API~~ feat(trainer): Add wait_for_job_status() API Jul 29, 2025

google-oss-prow bot removed the do-not-merge/work-in-progress label Jul 29, 2025

Update test case for timeout

476b1b2

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

kramaranya reviewed Jul 30, 2025

View reviewed changes

andreyvelich and others added 3 commits July 31, 2025 00:08

Update python/kubeflow/trainer/api/trainer_client.py

ddd2655

Co-authored-by: Anya Kramar <akramar@redhat.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Add Created status after TrainJob is created

675f101

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Wait for Created status

65b8a37

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

astefanutti reviewed Jul 31, 2025

View reviewed changes

Use watch API

a4f0bcd

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot added size/XL and removed size/L labels Jul 31, 2025

andreyvelich commented Jul 31, 2025

View reviewed changes

Remove time import

81e43fd

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot assigned astefanutti Jul 31, 2025

google-oss-prow bot added the lgtm label Jul 31, 2025

google-oss-prow bot added the approved label Jul 31, 2025

google-oss-prow bot merged commit 9bd64d8 into kubeflow:main Jul 31, 2025
8 checks passed

google-oss-prow bot added this to the v0.1 milestone Jul 31, 2025

andreyvelich deleted the wait-for-job-api branch July 31, 2025 16:54

	else:
	# The TrainJob running status is defined when all training node (e.g. Pods) are running.
	num_running_nodes = sum(
	1
	for step in trainjob.steps
	if step.name.startswith(constants.NODE)
	and step.status == constants.TRAINJOB_RUNNING
	)

feat(trainer): Add wait_for_job_status() API #52

feat(trainer): Add wait_for_job_status() API #52

Uh oh!

Conversation

andreyvelich commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note on TrainJob Status

Uh oh!

coveralls commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16650322582

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich commented Jul 31, 2025

Uh oh!

astefanutti commented Jul 31, 2025

Uh oh!

andreyvelich commented Jul 31, 2025

Uh oh!

google-oss-prow bot commented Jul 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(trainer): Add `wait_for_job_status()` API #52

feat(trainer): Add `wait_for_job_status()` API #52

andreyvelich commented Jul 28, 2025 •

edited

Loading

coveralls commented Jul 29, 2025 •

edited

Loading

astefanutti Jul 31, 2025 •

edited

Loading