-
Notifications
You must be signed in to change notification settings - Fork 41
feat(trainer): Add get_runtime_packages() API
#57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(trainer): Add get_runtime_packages() API
#57
Conversation
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
| # Check the status after event is generated for the TrainJob's Pods. | ||
| trainjob = self.get_job(name) | ||
| logger.debug(f"TrainJob {name}, status {trainjob.status}") | ||
| if polling_interval > timeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti I have to refactor the wait_for_job_status() API to perform polling as before.
The problem that I saw is when Pods are succeeded too fast and TrainJob controller doesn't add the Complete condition to the .status.conditions.
Since we only watch for Pod events, we can't catch this event, and TrainJob is stuck in Running condition.
Alternatively, we can watch both for TrainJob + Pods with two Python threads, but I am not sure if it worths it.
What do you think @astefanutti ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich I agree with you. This problem should be addressed when we'll have comprehensive TrainJob conditions. During the interim, better keep things simple in the SDK and refactor it once we'll have the new TrainJob conditions.
| device: str = constants.UNKNOWN | ||
| device_count: str = constants.UNKNOWN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I make it consistent with Step device and device_count.
I think, it looks better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great. We should also update notebooks in trainer examples
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Pull Request Test Coverage Report for Build 16759681455Details
💛 - Coveralls |
|
/lgtm Very nice! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this!
/lgtm
| device: str = constants.UNKNOWN | ||
| device_count: str = constants.UNKNOWN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great. We should also update notebooks in trainer examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
This API is similar to what I've showed at KubeCon London talk: https://youtu.be/Fnb1a5Kaxgo?t=555
It prints list of pre-installed Python packages and GPU devices (if
nvidia-smiis available)./assign @astefanutti @kramaranya @Electronic-Waste
/hold for review