-
Notifications
You must be signed in to change notification settings - Fork 41
feat: Implement TrainerClient Backends & Local Process #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Saad Zaher <szaher@redhat.com>
7ab03e1 to
3524abc
Compare
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
|
@eoinfennessy: changing LGTM is restricted to collaborators In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @szaher!
I left a few comments.
/cc @Electronic-Waste @astefanutti @saileshd1402 @johnugeorge @akshaychitneni @deepanker13 Appreciate your review as well!
Signed-off-by: Saad Zaher <szaher@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andreyvelich, I replied so of your comments
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com>
|
@andreyvelich regarding #33 (comment) a similar pattern is used here https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/llms/__init__.py#L653 |
|
@szaher Nice work. The concept of backends and the code and layout is very clean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @szaher! I left a few comments
|
/milestone v0.1 |
|
@kramaranya: You must be a member of the kubeflow/kubeflow-trainer-team GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Kubeflow Trainer Team and have them propose you as an additional delegate for this responsibility. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/assign @astefanutti @Electronic-Waste @tenzen-y @johnugeorge @saileshd1402 @akshaychitneni @deepanker13 @shravan-achar please review it when you get chance |
While I believe in simplicity and diving this into steps makes it easier for debugging and extensibility. Addressing comments on this PR consolidating all train job scripts into one and running it as single step to match k8s. Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this great feature @szaher!
As we discussed offline, we will address the improvements in the followup PRs after the initial release.
Looking forward to move this forward.
/lgtm
/approve
Signed-off-by: Saad Zaher <szaher@redhat.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
| - name: Upload coverage to Coveralls | ||
| uses: coverallsapp/github-action@v2 | ||
| continue-on-error: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ignore failure in the coveralls action for now.
I don't think that should block PR merges, and we've also seen outages of that service before.
|
/lgtm |
* Implement TrainerClient Backends & Local Process Signed-off-by: Saad Zaher <szaher@redhat.com> * Implement Job Cancellation Signed-off-by: Saad Zaher <szaher@redhat.com> * update local job to add resouce limitation in k8s style Signed-off-by: Saad Zaher <szaher@redhat.com> * Update python/kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * Fix linting issues Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * fix unit tests Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * add support wait_for_job_status Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * Update data types Signed-off-by: Saad Zaher <szaher@redhat.com> * fix merge conflict Signed-off-by: Saad Zaher <szaher@redhat.com> * fix unit tests Signed-off-by: Saad Zaher <szaher@redhat.com> * remove TypeAlias Signed-off-by: Saad Zaher <szaher@redhat.com> * Replace TRAINER_BACKEND_REGISTRY with TRAINER_BACKEND Signed-off-by: Saad Zaher <szaher@redhat.com> * Update kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * Update kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * Restructure training backends into separate dirs Signed-off-by: Saad Zaher <szaher@redhat.com> * Update kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * add get_runtime_packages as not supported by local-exec Signed-off-by: Saad Zaher <szaher@redhat.com> * move backends and its configs to kubeflow.trainer Signed-off-by: Saad Zaher <szaher@redhat.com> * fix typo in delete_job Signed-off-by: Saad Zaher <szaher@redhat.com> * Move local_runtimes to constants * Move local_runtimes to constants * allow list_jobs to filter by runtime * keep runtime ref in __local_jobs Signed-off-by: Saad Zaher <szaher@redhat.com> * use google style docstring for LocalJob Signed-off-by: Saad Zaher <szaher@redhat.com> * remove debug opt from LocalProcessConfig Signed-off-by: Saad Zaher <szaher@redhat.com> * only use imports from kubeflow.trainer for backends Signed-off-by: Saad Zaher <szaher@redhat.com> * upload local-exec to use only one step While I believe in simplicity and diving this into steps makes it easier for debugging and extensibility. Addressing comments on this PR consolidating all train job scripts into one and running it as single step to match k8s. Signed-off-by: Saad Zaher <szaher@redhat.com> * optimize loops when getting runtime Signed-off-by: Saad Zaher <szaher@redhat.com> * add LocalRuntimeTrainer Signed-off-by: Saad Zaher <szaher@redhat.com> * rename cleanup config item to cleanup_venv Signed-off-by: Saad Zaher <szaher@redhat.com> * convert local runtime to runtime Signed-off-by: Saad Zaher <szaher@redhat.com> * convert runtimes before returning Signed-off-by: Saad Zaher <szaher@redhat.com> * fix get_job_logs to align with parent interface Signed-off-by: Saad Zaher <szaher@redhat.com> * rename get_runtime_trainer func Signed-off-by: Saad Zaher <szaher@redhat.com> * rename get_training_job_command to get_local_train_job_script Signed-off-by: Saad Zaher <szaher@redhat.com> * Ignore failures in Coveralls action Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Saad Zaher <szaher@redhat.com> Signed-off-by: Saad Zaher <eng.szaher@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Implement TrainerClient Backends & Local Process Signed-off-by: Saad Zaher <szaher@redhat.com> * Implement Job Cancellation Signed-off-by: Saad Zaher <szaher@redhat.com> * update local job to add resouce limitation in k8s style Signed-off-by: Saad Zaher <szaher@redhat.com> * Update python/kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * Fix linting issues Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * fix unit tests Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * add support wait_for_job_status Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * Update data types Signed-off-by: Saad Zaher <szaher@redhat.com> * fix merge conflict Signed-off-by: Saad Zaher <szaher@redhat.com> * fix unit tests Signed-off-by: Saad Zaher <szaher@redhat.com> * remove TypeAlias Signed-off-by: Saad Zaher <szaher@redhat.com> * Replace TRAINER_BACKEND_REGISTRY with TRAINER_BACKEND Signed-off-by: Saad Zaher <szaher@redhat.com> * Update kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * Update kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * Restructure training backends into separate dirs Signed-off-by: Saad Zaher <szaher@redhat.com> * Update kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * add get_runtime_packages as not supported by local-exec Signed-off-by: Saad Zaher <szaher@redhat.com> * move backends and its configs to kubeflow.trainer Signed-off-by: Saad Zaher <szaher@redhat.com> * fix typo in delete_job Signed-off-by: Saad Zaher <szaher@redhat.com> * Move local_runtimes to constants * Move local_runtimes to constants * allow list_jobs to filter by runtime * keep runtime ref in __local_jobs Signed-off-by: Saad Zaher <szaher@redhat.com> * use google style docstring for LocalJob Signed-off-by: Saad Zaher <szaher@redhat.com> * remove debug opt from LocalProcessConfig Signed-off-by: Saad Zaher <szaher@redhat.com> * only use imports from kubeflow.trainer for backends Signed-off-by: Saad Zaher <szaher@redhat.com> * upload local-exec to use only one step While I believe in simplicity and diving this into steps makes it easier for debugging and extensibility. Addressing comments on this PR consolidating all train job scripts into one and running it as single step to match k8s. Signed-off-by: Saad Zaher <szaher@redhat.com> * optimize loops when getting runtime Signed-off-by: Saad Zaher <szaher@redhat.com> * add LocalRuntimeTrainer Signed-off-by: Saad Zaher <szaher@redhat.com> * rename cleanup config item to cleanup_venv Signed-off-by: Saad Zaher <szaher@redhat.com> * convert local runtime to runtime Signed-off-by: Saad Zaher <szaher@redhat.com> * convert runtimes before returning Signed-off-by: Saad Zaher <szaher@redhat.com> * fix get_job_logs to align with parent interface Signed-off-by: Saad Zaher <szaher@redhat.com> * rename get_runtime_trainer func Signed-off-by: Saad Zaher <szaher@redhat.com> * rename get_training_job_command to get_local_train_job_script Signed-off-by: Saad Zaher <szaher@redhat.com> * Ignore failures in Coveralls action Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Saad Zaher <szaher@redhat.com> Signed-off-by: Saad Zaher <eng.szaher@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Implement TrainerClient Backends & Local Process Signed-off-by: Saad Zaher <szaher@redhat.com> * Implement Job Cancellation Signed-off-by: Saad Zaher <szaher@redhat.com> * update local job to add resouce limitation in k8s style Signed-off-by: Saad Zaher <szaher@redhat.com> * Update python/kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * Fix linting issues Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * fix unit tests Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * add support wait_for_job_status Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * Update data types Signed-off-by: Saad Zaher <szaher@redhat.com> * fix merge conflict Signed-off-by: Saad Zaher <szaher@redhat.com> * fix unit tests Signed-off-by: Saad Zaher <szaher@redhat.com> * remove TypeAlias Signed-off-by: Saad Zaher <szaher@redhat.com> * Replace TRAINER_BACKEND_REGISTRY with TRAINER_BACKEND Signed-off-by: Saad Zaher <szaher@redhat.com> * Update kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * Update kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * Restructure training backends into separate dirs Signed-off-by: Saad Zaher <szaher@redhat.com> * Update kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * add get_runtime_packages as not supported by local-exec Signed-off-by: Saad Zaher <szaher@redhat.com> * move backends and its configs to kubeflow.trainer Signed-off-by: Saad Zaher <szaher@redhat.com> * fix typo in delete_job Signed-off-by: Saad Zaher <szaher@redhat.com> * Move local_runtimes to constants * Move local_runtimes to constants * allow list_jobs to filter by runtime * keep runtime ref in __local_jobs Signed-off-by: Saad Zaher <szaher@redhat.com> * use google style docstring for LocalJob Signed-off-by: Saad Zaher <szaher@redhat.com> * remove debug opt from LocalProcessConfig Signed-off-by: Saad Zaher <szaher@redhat.com> * only use imports from kubeflow.trainer for backends Signed-off-by: Saad Zaher <szaher@redhat.com> * upload local-exec to use only one step While I believe in simplicity and diving this into steps makes it easier for debugging and extensibility. Addressing comments on this PR consolidating all train job scripts into one and running it as single step to match k8s. Signed-off-by: Saad Zaher <szaher@redhat.com> * optimize loops when getting runtime Signed-off-by: Saad Zaher <szaher@redhat.com> * add LocalRuntimeTrainer Signed-off-by: Saad Zaher <szaher@redhat.com> * rename cleanup config item to cleanup_venv Signed-off-by: Saad Zaher <szaher@redhat.com> * convert local runtime to runtime Signed-off-by: Saad Zaher <szaher@redhat.com> * convert runtimes before returning Signed-off-by: Saad Zaher <szaher@redhat.com> * fix get_job_logs to align with parent interface Signed-off-by: Saad Zaher <szaher@redhat.com> * rename get_runtime_trainer func Signed-off-by: Saad Zaher <szaher@redhat.com> * rename get_training_job_command to get_local_train_job_script Signed-off-by: Saad Zaher <szaher@redhat.com> * Ignore failures in Coveralls action Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Saad Zaher <szaher@redhat.com> Signed-off-by: Saad Zaher <eng.szaher@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
What this PR does / why we need it: This PR introduces
LocalProcessBackenda new execution backend to enable users to run training jobs locally.LocalProcessBackendruns three jobs (deps,train,cleanup)LocalProcessBackendcreates a new virtualenvdepsstep installs depedencies passed by the user in theCustomTrainertrainstep runs the training script but will wait tilldepsis completed then runcleanupstep removes the virtualenv after training has completed.LocalProcessBackendConfigintroduces new configuration itemsLocalProcessBackendConfig.cleanupset to true by default. whether or not to remove the virtualenv after training is completedLocalProcessBackendConfig.debugadds some debugging statements. Default (False)LocalProcessBackendConfig.run_in_venv_dirwhether or not copy and run all execution files from the virtualenv location.torch-distributedas Runtime adjust to run locally.** Future features **:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #22
Checklist: