-
Notifications
You must be signed in to change notification settings - Fork 488
Adding out of the box support to TrainJob #2560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this effort @ram4444!
- Please also update RBAC: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/controller/rbac.yaml
- Primary pod labels a described here: https://www.kubeflow.org/docs/components/katib/user-guides/trial-template/#use-crds-with-trial-template: https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_defaults.go#L117
- Success and Failure conditions: https://github.com/kubeflow/katib/blob/master/pkg/apis/controller/experiments/v1beta1/experiment_defaults.go#L110
/assign @kubeflow/kubeflow-trainer-team @astefanutti @franciscojavierarceo @helenxie-bit
examples/v1beta1/kubeflow-training-operator/trainjob-pytorch.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this example to examples/v1beta1/kubeflow-trainer/trainjob-pytorch.yaml
# min: 1 | ||
# max: 5 | ||
trialTemplate: | ||
primaryContainerName: pytorch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be node
, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite understand your question
This is copy from pytorchjob-mnist.yaml and I keep it the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container name for torch-distributed
runtime is node
: https://github.com/kubeflow/trainer/blob/master/manifests/base/runtimes/torch_distributed.yaml#L22
You can read about API description for trialParameters
in this doc: https://www.kubeflow.org/docs/components/katib/user-guides/trial-template/#configure-trial-template-specification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done it. Please state in somewhere in the doc that this field should be referenced to the ClusterTrainingRuntime. Since I thought it is a custom defined name of the container by the user
- name: arg | ||
description: An additional argument for the training model | ||
reference: arg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I will removed it.
Should I create another PR or put everything on top of these 2 commits?
BTW Do you have any docker image for pytorch-deepspeed_train_t5 so that I can create another example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I create another PR or put everything on top of these 2 commits?
You can make the appropriate changes in this PR.
BTW Do you have any docker image for pytorch-deepspeed_train_t5 so that I can create another example?
We don't have docker image, since we create it using the Kubeflow SDK: https://github.com/kubeflow/trainer/blob/master/examples/deepspeed/text-summarization/T5-Fine-Tuning.ipynb
2062958
to
cb319a2
Compare
/ok-to-test once that is in i can merge kubeflow/manifests#3199 |
@ram4444 Please can you update the controller code according to this: #2560 (review) ? |
Hi, Is it simply adding entry to
? |
No, you have to update other places as I mentioned in this comment: #2560 (review) |
Still not get a clear idea of it. Could you explain more? |
Could you read this doc which explains how CRDs within Katib Trial work: https://www.kubeflow.org/docs/components/katib/user-guides/trial-template/#use-crds-with-trial-template ? |
I have go through the code and doc, but I am not understand what is going to be added/changed in the lines (110&117) specified. Am I go to add an else condition to TrainJob (but I am not sure what is going to be the Default Fail/SuccessCondition/PrimaryPodLabels)?
|
Here are the values that we should use for TrainJob: DefaultTrainJobSuccessCondition = "status.conditions.#(type==\"Complete\")#|#(status==\"True\")#"
DefaultTrainJobFailureCondition = "status.conditions.#(type==\"Failed\")#|#(status==\"True\")#"
DefaultTrainJobPrimaryPodLabels = map[string]string{"jobset.sigs.k8s.io/replicatedjob-name": "node"} |
Please consider whether we should to put them in const.go to be consistent. (or I could proceed as mentioned) |
Yes, please add them into constants.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ram4444 Would you be able to verify that this integration works on your local Kind cluster?
Since we don't have E2Es for TrainJob, it would be nice to verify it.
cc @Electronic-Waste @kubeflow/kubeflow-trainer-team @astefanutti
t.PrimaryPodLabels = DefaultKubeflowJobPrimaryPodLabels | ||
} | ||
} | ||
} else if jobKind == "TrainJob" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add TrainJob to the KubeflowJobKinds list as well please ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function will turn to something like
func (e *Experiment) setDefaultTrialTemplate() {
t := e.Spec.TrialTemplate
// Set default values for Job and Kubeflow Training Job if TrialSpec is not nil
if t != nil && t.TrialSource.TrialSpec != nil {
jobKind := t.TrialSource.TrialSpec.GetKind()
if jobKind == consts.JobKindJob {
if t.SuccessCondition == "" {
t.SuccessCondition = DefaultJobSuccessCondition
}
if t.FailureCondition == "" {
t.FailureCondition = DefaultJobFailureCondition
}
} else if _, ok := KubeflowJobKinds[jobKind]; ok {
if t.SuccessCondition == "" {
if jobKind == "TrainJob"
t.SuccessCondition = DefaultTrainJobSuccessCondition
else
t.SuccessCondition = DefaultKubeflowJobSuccessCondition
}
if t.FailureCondition == "" {
if jobKind == "TrainJob"
t.FailureCondition = DefaultTrainJobFailureCondition
else
t.FailureCondition = DefaultKubeflowJobFailureCondition
}
// For Kubeflow Job also set default PrimaryPodLabels
if len(t.PrimaryPodLabels) == 0 {
if jobKind == "TrainJob"
t.PrimaryPodLabels = DefaultTrainJobPrimaryPodLabels
else
t.PrimaryPodLabels = DefaultKubeflowJobPrimaryPodLabels
}
}
}
e.Spec.TrialTemplate = t
}
To test it in my local k8s
since the katib controller is installed by pulling image in kustomize,
please let me know any command to build the image from scratch
images:
- name: ghcr.io/kubeflow/katib/katib-controller
newName: ghcr.io/kubeflow/katib/katib-controller
newTag: v0.18.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can build controller as follows:
docker build . -f cmd/katib-controller/v1beta1/Dockerfile -t <MY_IMAGE>
Hi, I am sorry to inform that due to hardware issue (my 10-years-old homelab which is the only server capable to run the whole Kubeflow is down), I am not able to test it at this moment.😩 I could commit my latest code to my own repo first. Please let me know if it is ok to proceed. Ram |
Sure, no problem, I can try to deploy it from my machine. Please push your latest changes. Btw, you don't need to deploy the entire Kubeflow Platform, you can just deploy Katib + Trainer control plane to verify it. |
Yo can run it with 4 GB or so. just remove what you do not need. See https://github.com/kubeflow/manifests#kubeflow-components-versions |
@andreyvelich @juliusvonkohout I have committed the latest changes to my repo and I have build the image of katib controller to my dockerhub repo cmd of my deployment
The training operator produce the following error log
|
should I replace the image with my own latest build or just use the original ghcr one? |
Yes, please use your image: |
Trainjob
I still cannot find the metrics collector and it is showing the same error log in the katib controller log after the job has finished |
Can you show full log from Katib controller ?
What do you mean ? I can see that TrainJob that you showed uses Katib's Trial, so Experiment should be there. |
I ve just re-run the experiment and here is the log when it starts
when it comes to the end
|
We also should add |
so the constant.go should be updated to
Just rebuild the image and add the label but still got the same result
|
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Found an issue. Katib needs to understand whether desired pod belongs to Trial here: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/inject_webhook.go#L286-L290 Now Katib can optimize HPs on TrainJobs 🎉 $ k get trial -n $NS
NAME TYPE STATUS AGE
torch-distributed-example-2pnvtn72 Running True 22s
torch-distributed-example-4nqzdhvw Running True 21s
torch-distributed-example-9v79g85p Succeeded True 5m2s
torch-distributed-example-ftw6rv7f Running True 28s
torch-distributed-example-jq9jz7mt Succeeded True 5m2s
torch-distributed-example-phsqstg5 Succeeded True 5m2s
$ k get trainjob -n $NS
NAME STATE AGE
torch-distributed-example-2pnvtn72 25s
torch-distributed-example-4nqzdhvw 24s
torch-distributed-example-ftw6rv7f 31s |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great contributions @ram4444!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich |
/lgtm |
Just pinge me if you have a release to synchronize. |
This looks great! This will be especially useful when we migrate Katib to Kubeflow SDK. Thank you for working on this! |
What this PR does / why we need it:
Adding out of the box support to TrainJob and providing an example
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: