-
Notifications
You must be signed in to change notification settings - Fork 834
Add volume and volume mounts arguments to TrainingClient.create_job API #2449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How we can make it easier to configure from the ML engineer perspective ?
I know that @truc0 has implemented it in a way that we can accept PVC, ConfigMap, or Secret as volume for Trial: kubeflow/katib#2508. But it still uses Kubernetes APIs.
@hbelmiro @rimolive @HumairAK Do we know if we have some sort of Kubernetes volume support in KFP V2 ?
I noticed that it is not intended to be implemented in V2: kubeflow/pipelines#8570
In that case, how users can attach volumes to their component in a workflow ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @HumairAK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In KFP users can access underlying k8s pod spec apis like mounting secrets/volumes/configmaps etc. (example for volumes here)
This exposes k8s apis to the KFP sdk, so it begs the question how friendly this is to the ML engineer persona, one who is not so k8s aware.
I'm not caught up on what use case this PR is looking to resolve, but in KFP we have a specific case of data passing between pipeline jobs, the storage of this data is abstracted via object store, but we see that users sometimes want to use their PVC for this instead without having to manually configure these PVC's in the pipeline. The end result being, you just declare what inputs/outputs you want in your python pipeline sdk, and under the hood we automatically utilize the PVC that you provided when you deployed KFP. This is a feature we'd like to introduce to KFP in the future, you can track it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use case if for any distributed training job to be able to persist checkpoints. For example, this is what the InstructLab fine-tuning step does, and that can only be done via the PyTorchJob API at the moment, not via the SDK.
With this PR, it'll also be possible to do it with the SDK.
Stated differently, without a way to pass a PVC to the PyTorchJob via the SDK, the SDK is rather useless for any real distributed training jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @astefanutti, I just think whether we should give MLE control to all Volume and VolumeMount APIs.
Maybe specifying the volumeName is Sufficient enough.
Maybe we can improve it in V2 SDK given the separation between TrainJob and TrainingRuntime.