Skip to content

Conversation

@andreyvelich
Copy link
Member

This API is similar to what I've showed at KubeCon London talk: https://youtu.be/Fnb1a5Kaxgo?t=555

It prints list of pre-installed Python packages and GPU devices (if nvidia-smi is available).

/assign @astefanutti @kramaranya @Electronic-Waste

/hold for review

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
# Check the status after event is generated for the TrainJob's Pods.
trainjob = self.get_job(name)
logger.debug(f"TrainJob {name}, status {trainjob.status}")
if polling_interval > timeout:
Copy link
Member Author

@andreyvelich andreyvelich Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti I have to refactor the wait_for_job_status() API to perform polling as before.
The problem that I saw is when Pods are succeeded too fast and TrainJob controller doesn't add the Complete condition to the .status.conditions.

Since we only watch for Pod events, we can't catch this event, and TrainJob is stuck in Running condition.

Alternatively, we can watch both for TrainJob + Pods with two Python threads, but I am not sure if it worths it.

What do you think @astefanutti ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich I agree with you. This problem should be addressed when we'll have comprehensive TrainJob conditions. During the interim, better keep things simple in the SDK and refactor it once we'll have the new TrainJob conditions.

Comment on lines +171 to +172
device: str = constants.UNKNOWN
device_count: str = constants.UNKNOWN
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make it consistent with Step device and device_count.
I think, it looks better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. We should also update notebooks in trainer examples

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@coveralls
Copy link

coveralls commented Aug 5, 2025

Pull Request Test Coverage Report for Build 16759681455

Details

  • 24 of 40 (60.0%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-1.3%) to 64.719%

Changes Missing Coverage Covered Lines Changed/Added Lines %
python/kubeflow/trainer/utils/utils.py 4 5 80.0%
python/kubeflow/trainer/api/trainer_client.py 20 35 57.14%
Files with Coverage Reduction New Missed Lines %
python/kubeflow/trainer/api/trainer_client.py 1 69.74%
Totals Coverage Status
Change from base Build 16750742552: -1.3%
Covered Lines: 288
Relevant Lines: 445

💛 - Coveralls

@astefanutti
Copy link
Contributor

/lgtm

Very nice!

@google-oss-prow google-oss-prow bot added the lgtm label Aug 6, 2025
Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!
/lgtm

Comment on lines +171 to +172
device: str = constants.UNKNOWN
device_count: str = constants.UNKNOWN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. We should also update notebooks in trainer examples

Copy link
Member Author

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich
Copy link
Member Author

/hold cancel

@google-oss-prow google-oss-prow bot merged commit d90dbce into kubeflow:main Aug 6, 2025
9 checks passed
@google-oss-prow google-oss-prow bot added this to the v0.1 milestone Aug 6, 2025
@andreyvelich andreyvelich deleted the get-runtime-packages branch August 6, 2025 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants