feat(trainer): Add `get_runtime_packages()` API #57

andreyvelich · 2025-08-05T18:12:02Z

This API is similar to what I've showed at KubeCon London talk: https://youtu.be/Fnb1a5Kaxgo?t=555

It prints list of pre-installed Python packages and GPU devices (if nvidia-smi is available).

/assign @astefanutti @kramaranya @Electronic-Waste

/hold for review

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-08-05T18:16:08Z

python/kubeflow/trainer/api/trainer_client.py

-                # Check the status after event is generated for the TrainJob's Pods.
-                trainjob = self.get_job(name)
-                logger.debug(f"TrainJob {name}, status {trainjob.status}")
+        if polling_interval > timeout:


@astefanutti I have to refactor the wait_for_job_status() API to perform polling as before.
The problem that I saw is when Pods are succeeded too fast and TrainJob controller doesn't add the Complete condition to the .status.conditions.

Since we only watch for Pod events, we can't catch this event, and TrainJob is stuck in Running condition.

Alternatively, we can watch both for TrainJob + Pods with two Python threads, but I am not sure if it worths it.

What do you think @astefanutti ?

@andreyvelich I agree with you. This problem should be addressed when we'll have comprehensive TrainJob conditions. During the interim, better keep things simple in the SDK and refactor it once we'll have the new TrainJob conditions.

andreyvelich · 2025-08-05T18:29:50Z

python/kubeflow/trainer/types/types.py

+    device: str = constants.UNKNOWN
+    device_count: str = constants.UNKNOWN


I make it consistent with Step device and device_count.
I think, it looks better.

Sounds great. We should also update notebooks in trainer examples

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

coveralls · 2025-08-05T19:45:18Z

Pull Request Test Coverage Report for Build 16759681455

Details

24 of 40 (60.0%) changed or added relevant lines in 2 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-1.3%) to 64.719%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
python/kubeflow/trainer/utils/utils.py	4	5	80.0%
python/kubeflow/trainer/api/trainer_client.py	20	35	57.14%

Files with Coverage Reduction	New Missed Lines	%
python/kubeflow/trainer/api/trainer_client.py	1	69.74%

Totals
Change from base Build 16750742552:	-1.3%
Covered Lines:	288
Relevant Lines:	445

💛 - Coveralls

astefanutti · 2025-08-06T07:58:24Z

/lgtm

Very nice!

kramaranya

Thank you for this!
/lgtm

kramaranya · 2025-08-06T08:20:31Z

python/kubeflow/trainer/types/types.py

+    device: str = constants.UNKNOWN
+    device_count: str = constants.UNKNOWN


Sounds great. We should also update notebooks in trainer examples

andreyvelich

Thanks!
/approve

google-oss-prow · 2025-08-06T16:27:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2025-08-06T16:29:32Z

/hold cancel

andreyvelich added 3 commits August 5, 2025 18:21

feat(trainer): Add get_runtime_packages() API

2788852

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Fix mpirun command

02c86d0

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Print Python version

b435e07

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot assigned astefanutti Aug 5, 2025

google-oss-prow bot added the do-not-merge/hold label Aug 5, 2025

google-oss-prow bot assigned Electronic-Waste and kramaranya Aug 5, 2025

google-oss-prow bot requested review from Electronic-Waste and astefanutti August 5, 2025 18:12

google-oss-prow bot added the size/L label Aug 5, 2025

andreyvelich commented Aug 5, 2025

View reviewed changes

andreyvelich added 2 commits August 5, 2025 20:41

Add unit tests

91a971e

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Fix verify

08fd0c1

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot added the lgtm label Aug 6, 2025

kramaranya reviewed Aug 6, 2025

View reviewed changes

andreyvelich commented Aug 6, 2025

View reviewed changes

google-oss-prow bot added the approved label Aug 6, 2025

google-oss-prow bot removed the do-not-merge/hold label Aug 6, 2025

google-oss-prow bot merged commit d90dbce into kubeflow:main Aug 6, 2025
9 checks passed

google-oss-prow bot added this to the v0.1 milestone Aug 6, 2025

andreyvelich deleted the get-runtime-packages branch August 6, 2025 16:37

andreyvelich mentioned this pull request Aug 7, 2025

chore(runtimes): Update packages in DeepSpeed runtime and fix T5 example kubeflow/trainer#2781

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(trainer): Add `get_runtime_packages()` API #57

feat(trainer): Add `get_runtime_packages()` API #57

Uh oh!

andreyvelich commented Aug 5, 2025

Uh oh!

andreyvelich Aug 5, 2025 •

edited

Loading

Uh oh!

astefanutti Aug 6, 2025

Uh oh!

andreyvelich Aug 5, 2025

Uh oh!

kramaranya Aug 6, 2025

Uh oh!

coveralls commented Aug 5, 2025 •

edited

Loading

Uh oh!

astefanutti commented Aug 6, 2025

Uh oh!

kramaranya left a comment

Uh oh!

kramaranya Aug 6, 2025

Uh oh!

andreyvelich left a comment

Uh oh!

google-oss-prow bot commented Aug 6, 2025

Uh oh!

andreyvelich commented Aug 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		device: str = constants.UNKNOWN
		device_count: str = constants.UNKNOWN

feat(trainer): Add get_runtime_packages() API #57

feat(trainer): Add get_runtime_packages() API #57

Uh oh!

Conversation

andreyvelich commented Aug 5, 2025

Uh oh!

andreyvelich Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

andreyvelich Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

kramaranya Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

coveralls commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16759681455

Details

💛 - Coveralls

Uh oh!

astefanutti commented Aug 6, 2025

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

kramaranya Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Aug 6, 2025

Uh oh!

andreyvelich commented Aug 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat(trainer): Add `get_runtime_packages()` API #57

feat(trainer): Add `get_runtime_packages()` API #57

andreyvelich Aug 5, 2025 •

edited

Loading

coveralls commented Aug 5, 2025 •

edited

Loading