Skip to content

Conversation

@szaher
Copy link
Member

@szaher szaher commented Jun 21, 2025

What this PR does / why we need it: This PR introduces

  • Implement LocalProcessBackend a new execution backend to enable users to run training jobs locally.
  • LocalProcessBackend runs three jobs (deps, train, cleanup)
  • LocalProcessBackend creates a new virtualenv
    • deps step installs depedencies passed by the user in the CustomTrainer
    • train step runs the training script but will wait till deps is completed then run
    • cleanup step removes the virtualenv after training has completed.
  • LocalProcessBackendConfig introduces new configuration items
    • LocalProcessBackendConfig.cleanup set to true by default. whether or not to remove the virtualenv after training is completed
    • LocalProcessBackendConfig.debug adds some debugging statements. Default (False)
    • LocalProcessBackendConfig.run_in_venv_dir whether or not copy and run all execution files from the virtualenv location.
  • Add torch-distributed as Runtime adjust to run locally.

** Future features **:

  • [TBD] Implement dataset initializer step
  • [TBD] Implement model initializer step

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #22

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: Saad Zaher <szaher@redhat.com>
@szaher szaher force-pushed the training-backends branch from 7ab03e1 to 3524abc Compare June 21, 2025 14:07
@szaher szaher changed the title [WIP] Implement TrainerClient Backends & Local Process Implement TrainerClient Backends & Local Process Jun 25, 2025
Signed-off-by: Saad Zaher <szaher@redhat.com>
@google-oss-prow
Copy link

@eoinfennessy: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @szaher!
I left a few comments.
/cc @Electronic-Waste @astefanutti @saileshd1402 @johnugeorge @akshaychitneni @deepanker13 Appreciate your review as well!

Signed-off-by: Saad Zaher <szaher@redhat.com>
Copy link
Member Author

@szaher szaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andreyvelich, I replied so of your comments

szaher

This comment was marked as duplicate.

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
@szaher
Copy link
Member Author

szaher commented Jul 9, 2025

@andreyvelich regarding #33 (comment)
It acts as validation layer for the sdk and it helps the trainer client associates the right backend class with it's pydantic configuration.

a similar pattern is used here https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/llms/__init__.py#L653

@briangallagher
Copy link
Contributor

@szaher Nice work. The concept of backends and the code and layout is very clean.
If we wanted to add support for lightweight kube distributions such as Kind or k3s.io would they be just a K8s backend or a distinct backend?

Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @szaher! I left a few comments

@kramaranya
Copy link
Contributor

/milestone v0.1

@google-oss-prow
Copy link

@kramaranya: You must be a member of the kubeflow/kubeflow-trainer-team GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Kubeflow Trainer Team and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v0.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kramaranya
Copy link
Contributor

/assign @astefanutti @Electronic-Waste @tenzen-y @johnugeorge @saileshd1402 @akshaychitneni @deepanker13 @shravan-achar please review it when you get chance
/hold for others to review this

While I believe in simplicity and diving this into steps makes it easier
for debugging and extensibility. Addressing comments on this PR
consolidating all train job scripts into one and running it as single
step to match k8s.

Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this great feature @szaher!
As we discussed offline, we will address the improvements in the followup PRs after the initial release.

Looking forward to move this forward.
/lgtm
/approve

Signed-off-by: Saad Zaher <szaher@redhat.com>
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot added approved and removed lgtm labels Sep 14, 2025
@andreyvelich
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Sep 14, 2025
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Sep 15, 2025
- name: Upload coverage to Coveralls
uses: coverallsapp/github-action@v2
continue-on-error: true
Copy link
Member

@andreyvelich andreyvelich Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ignore failure in the coveralls action for now.
I don't think that should block PR merges, and we've also seen outages of that service before.

@andreyvelich
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Sep 15, 2025
@google-oss-prow google-oss-prow bot merged commit ffc3d62 into kubeflow:main Sep 15, 2025
10 checks passed
accorvin pushed a commit to opendatahub-io/kubeflow-sdk that referenced this pull request Oct 8, 2025
* Implement TrainerClient Backends & Local Process

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Implement Job Cancellation

Signed-off-by: Saad Zaher <szaher@redhat.com>

* update local job to add resouce limitation in k8s style

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update python/kubeflow/trainer/api/trainer_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>

* Fix linting issues

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

* fix unit tests

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

* add support wait_for_job_status

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

* Update data types

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix merge conflict

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix unit tests

Signed-off-by: Saad Zaher <szaher@redhat.com>

* remove TypeAlias

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Replace TRAINER_BACKEND_REGISTRY with TRAINER_BACKEND

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update kubeflow/trainer/api/trainer_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update kubeflow/trainer/api/trainer_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>

* Restructure training backends into separate dirs

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update kubeflow/trainer/api/trainer_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>

* add get_runtime_packages as not supported by local-exec

Signed-off-by: Saad Zaher <szaher@redhat.com>

* move backends and its configs to kubeflow.trainer

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix typo in delete_job

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Move local_runtimes to constants

 * Move local_runtimes to constants
 * allow list_jobs to filter by runtime
 * keep runtime ref in __local_jobs

Signed-off-by: Saad Zaher <szaher@redhat.com>

* use google style docstring for LocalJob

Signed-off-by: Saad Zaher <szaher@redhat.com>

* remove debug opt from LocalProcessConfig

Signed-off-by: Saad Zaher <szaher@redhat.com>

* only use imports from kubeflow.trainer for backends

Signed-off-by: Saad Zaher <szaher@redhat.com>

* upload local-exec to use only one step

While I believe in simplicity and diving this into steps makes it easier
for debugging and extensibility. Addressing comments on this PR
consolidating all train job scripts into one and running it as single
step to match k8s.

Signed-off-by: Saad Zaher <szaher@redhat.com>

* optimize loops when getting runtime

Signed-off-by: Saad Zaher <szaher@redhat.com>

* add LocalRuntimeTrainer

Signed-off-by: Saad Zaher <szaher@redhat.com>

* rename cleanup config item to cleanup_venv

Signed-off-by: Saad Zaher <szaher@redhat.com>

* convert local runtime to runtime

Signed-off-by: Saad Zaher <szaher@redhat.com>

* convert runtimes before returning

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix get_job_logs to align with parent interface

Signed-off-by: Saad Zaher <szaher@redhat.com>

* rename get_runtime_trainer func

Signed-off-by: Saad Zaher <szaher@redhat.com>

* rename get_training_job_command to get_local_train_job_script

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Ignore failures in Coveralls action

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <eng.szaher@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
MStokluska pushed a commit to opendatahub-io/kubeflow-sdk that referenced this pull request Oct 15, 2025
* Implement TrainerClient Backends & Local Process

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Implement Job Cancellation

Signed-off-by: Saad Zaher <szaher@redhat.com>

* update local job to add resouce limitation in k8s style

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update python/kubeflow/trainer/api/trainer_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>

* Fix linting issues

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

* fix unit tests

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

* add support wait_for_job_status

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

* Update data types

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix merge conflict

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix unit tests

Signed-off-by: Saad Zaher <szaher@redhat.com>

* remove TypeAlias

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Replace TRAINER_BACKEND_REGISTRY with TRAINER_BACKEND

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update kubeflow/trainer/api/trainer_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update kubeflow/trainer/api/trainer_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>

* Restructure training backends into separate dirs

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update kubeflow/trainer/api/trainer_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>

* add get_runtime_packages as not supported by local-exec

Signed-off-by: Saad Zaher <szaher@redhat.com>

* move backends and its configs to kubeflow.trainer

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix typo in delete_job

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Move local_runtimes to constants

 * Move local_runtimes to constants
 * allow list_jobs to filter by runtime
 * keep runtime ref in __local_jobs

Signed-off-by: Saad Zaher <szaher@redhat.com>

* use google style docstring for LocalJob

Signed-off-by: Saad Zaher <szaher@redhat.com>

* remove debug opt from LocalProcessConfig

Signed-off-by: Saad Zaher <szaher@redhat.com>

* only use imports from kubeflow.trainer for backends

Signed-off-by: Saad Zaher <szaher@redhat.com>

* upload local-exec to use only one step

While I believe in simplicity and diving this into steps makes it easier
for debugging and extensibility. Addressing comments on this PR
consolidating all train job scripts into one and running it as single
step to match k8s.

Signed-off-by: Saad Zaher <szaher@redhat.com>

* optimize loops when getting runtime

Signed-off-by: Saad Zaher <szaher@redhat.com>

* add LocalRuntimeTrainer

Signed-off-by: Saad Zaher <szaher@redhat.com>

* rename cleanup config item to cleanup_venv

Signed-off-by: Saad Zaher <szaher@redhat.com>

* convert local runtime to runtime

Signed-off-by: Saad Zaher <szaher@redhat.com>

* convert runtimes before returning

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix get_job_logs to align with parent interface

Signed-off-by: Saad Zaher <szaher@redhat.com>

* rename get_runtime_trainer func

Signed-off-by: Saad Zaher <szaher@redhat.com>

* rename get_training_job_command to get_local_train_job_script

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Ignore failures in Coveralls action

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <eng.szaher@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
MStokluska pushed a commit to opendatahub-io/kubeflow-sdk that referenced this pull request Oct 15, 2025
* Implement TrainerClient Backends & Local Process

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Implement Job Cancellation

Signed-off-by: Saad Zaher <szaher@redhat.com>

* update local job to add resouce limitation in k8s style

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update python/kubeflow/trainer/api/trainer_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>

* Fix linting issues

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

* fix unit tests

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

* add support wait_for_job_status

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

* Update data types

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix merge conflict

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix unit tests

Signed-off-by: Saad Zaher <szaher@redhat.com>

* remove TypeAlias

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Replace TRAINER_BACKEND_REGISTRY with TRAINER_BACKEND

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update kubeflow/trainer/api/trainer_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update kubeflow/trainer/api/trainer_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>

* Restructure training backends into separate dirs

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Update kubeflow/trainer/api/trainer_client.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>

* add get_runtime_packages as not supported by local-exec

Signed-off-by: Saad Zaher <szaher@redhat.com>

* move backends and its configs to kubeflow.trainer

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix typo in delete_job

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Move local_runtimes to constants

 * Move local_runtimes to constants
 * allow list_jobs to filter by runtime
 * keep runtime ref in __local_jobs

Signed-off-by: Saad Zaher <szaher@redhat.com>

* use google style docstring for LocalJob

Signed-off-by: Saad Zaher <szaher@redhat.com>

* remove debug opt from LocalProcessConfig

Signed-off-by: Saad Zaher <szaher@redhat.com>

* only use imports from kubeflow.trainer for backends

Signed-off-by: Saad Zaher <szaher@redhat.com>

* upload local-exec to use only one step

While I believe in simplicity and diving this into steps makes it easier
for debugging and extensibility. Addressing comments on this PR
consolidating all train job scripts into one and running it as single
step to match k8s.

Signed-off-by: Saad Zaher <szaher@redhat.com>

* optimize loops when getting runtime

Signed-off-by: Saad Zaher <szaher@redhat.com>

* add LocalRuntimeTrainer

Signed-off-by: Saad Zaher <szaher@redhat.com>

* rename cleanup config item to cleanup_venv

Signed-off-by: Saad Zaher <szaher@redhat.com>

* convert local runtime to runtime

Signed-off-by: Saad Zaher <szaher@redhat.com>

* convert runtimes before returning

Signed-off-by: Saad Zaher <szaher@redhat.com>

* fix get_job_logs to align with parent interface

Signed-off-by: Saad Zaher <szaher@redhat.com>

* rename get_runtime_trainer func

Signed-off-by: Saad Zaher <szaher@redhat.com>

* rename get_training_job_command to get_local_train_job_script

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Ignore failures in Coveralls action

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <eng.szaher@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Local Execution of Training Jobs

9 participants