feat: Implement TrainerClient Backends & Local Process #33

szaher · 2025-06-21T14:06:02Z

What this PR does / why we need it: This PR introduces

Implement LocalProcessBackend a new execution backend to enable users to run training jobs locally.
LocalProcessBackend runs three jobs (deps, train, cleanup)
LocalProcessBackend creates a new virtualenv
- deps step installs depedencies passed by the user in the CustomTrainer
- train step runs the training script but will wait till deps is completed then run
- cleanup step removes the virtualenv after training has completed.
LocalProcessBackendConfig introduces new configuration items
- LocalProcessBackendConfig.cleanup set to true by default. whether or not to remove the virtualenv after training is completed
- LocalProcessBackendConfig.debug adds some debugging statements. Default (False)
- LocalProcessBackendConfig.run_in_venv_dir whether or not copy and run all execution files from the virtualenv location.
Add torch-distributed as Runtime adjust to run locally.

** Future features **:

[TBD] Implement dataset initializer step
[TBD] Implement model initializer step

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #22

Checklist:

Docs included if any changes are user facing

Signed-off-by: Saad Zaher <szaher@redhat.com>

python/kubeflow/trainer/api/trainer_client.py

google-oss-prow · 2025-07-08T09:52:32Z

@eoinfennessy: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andreyvelich

Thank you for this @szaher!
I left a few comments.
/cc @Electronic-Waste @astefanutti @saileshd1402 @johnugeorge @akshaychitneni @deepanker13 Appreciate your review as well!

python/kubeflow/trainer/api/trainer_client.py

python/kubeflow/trainer/backends/k8s.py

python/kubeflow/trainer/backends/local_process.py

python/kubeflow/trainer/local/job.py

python/kubeflow/trainer/utils/local.py

Signed-off-by: Saad Zaher <szaher@redhat.com>

szaher

Thanks @andreyvelich, I replied so of your comments

python/kubeflow/trainer/backends/local_process.py

python/kubeflow/trainer/local/job.py

python/kubeflow/trainer/types/local.py

python/kubeflow/trainer/utils/local.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

szaher · 2025-07-09T18:19:14Z

@andreyvelich regarding #33 (comment)
It acts as validation layer for the sdk and it helps the trainer client associates the right backend class with it's pydantic configuration.

a similar pattern is used here https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/llms/__init__.py#L653

briangallagher · 2025-07-17T11:28:02Z

@szaher Nice work. The concept of backends and the code and layout is very clean.
If we wanted to add support for lightweight kube distributions such as Kind or k3s.io would they be just a K8s backend or a distinct backend?

kramaranya

Thanks a lot @szaher! I left a few comments

python/kubeflow/trainer/api/trainer_client.py

python/kubeflow/trainer/backends/base.py

python/kubeflow/trainer/backends/k8s.py

python/kubeflow/trainer/api/trainer_client.py

kramaranya · 2025-07-23T15:55:48Z

/milestone v0.1

google-oss-prow · 2025-07-23T15:55:51Z

@kramaranya: You must be a member of the kubeflow/kubeflow-trainer-team GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Kubeflow Trainer Team and have them propose you as an additional delegate for this responsibility.

In response to this:

/milestone v0.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kramaranya · 2025-08-06T20:41:41Z

/assign @astefanutti @Electronic-Waste @tenzen-y @johnugeorge @saileshd1402 @akshaychitneni @deepanker13 @shravan-achar please review it when you get chance
/hold for others to review this

kubeflow/trainer/__init__.py

kubeflow/trainer/backends/localprocess/backend.py

kubeflow/trainer/backends/localprocess/types.py

While I believe in simplicity and diving this into steps makes it easier for debugging and extensibility. Addressing comments on this PR consolidating all train job scripts into one and running it as single step to match k8s. Signed-off-by: Saad Zaher <szaher@redhat.com>

Signed-off-by: Saad Zaher <szaher@redhat.com>

kubeflow/trainer/backends/localprocess/backend.py

kubeflow/trainer/backends/localprocess/types.py

Signed-off-by: Saad Zaher <szaher@redhat.com>

kubeflow/trainer/backends/localprocess/backend.py

kubeflow/trainer/backends/localprocess/utils.py

kubeflow/trainer/backends/localprocess/backend.py

Signed-off-by: Saad Zaher <szaher@redhat.com>

andreyvelich

Thank you for this great feature @szaher!
As we discussed offline, we will address the improvements in the followup PRs after the initial release.

Looking forward to move this forward.
/lgtm
/approve

Signed-off-by: Saad Zaher <szaher@redhat.com>

google-oss-prow · 2025-09-14T17:32:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich · 2025-09-14T17:52:31Z

/lgtm

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-09-15T00:43:15Z

.github/workflows/test-python.yaml


      - name: Upload coverage to Coveralls
        uses: coverallsapp/github-action@v2
+        continue-on-error: true


I ignore failure in the coveralls action for now.
I don't think that should block PR merges, and we've also seen outages of that service before.

andreyvelich · 2025-09-15T00:45:28Z

/lgtm

* Implement TrainerClient Backends & Local Process Signed-off-by: Saad Zaher <szaher@redhat.com> * Implement Job Cancellation Signed-off-by: Saad Zaher <szaher@redhat.com> * update local job to add resouce limitation in k8s style Signed-off-by: Saad Zaher <szaher@redhat.com> * Update python/kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * Fix linting issues Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * fix unit tests Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * add support wait_for_job_status Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * Update data types Signed-off-by: Saad Zaher <szaher@redhat.com> * fix merge conflict Signed-off-by: Saad Zaher <szaher@redhat.com> * fix unit tests Signed-off-by: Saad Zaher <szaher@redhat.com> * remove TypeAlias Signed-off-by: Saad Zaher <szaher@redhat.com> * Replace TRAINER_BACKEND_REGISTRY with TRAINER_BACKEND Signed-off-by: Saad Zaher <szaher@redhat.com> * Update kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * Update kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * Restructure training backends into separate dirs Signed-off-by: Saad Zaher <szaher@redhat.com> * Update kubeflow/trainer/api/trainer_client.py Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com> * add get_runtime_packages as not supported by local-exec Signed-off-by: Saad Zaher <szaher@redhat.com> * move backends and its configs to kubeflow.trainer Signed-off-by: Saad Zaher <szaher@redhat.com> * fix typo in delete_job Signed-off-by: Saad Zaher <szaher@redhat.com> * Move local_runtimes to constants * Move local_runtimes to constants * allow list_jobs to filter by runtime * keep runtime ref in __local_jobs Signed-off-by: Saad Zaher <szaher@redhat.com> * use google style docstring for LocalJob Signed-off-by: Saad Zaher <szaher@redhat.com> * remove debug opt from LocalProcessConfig Signed-off-by: Saad Zaher <szaher@redhat.com> * only use imports from kubeflow.trainer for backends Signed-off-by: Saad Zaher <szaher@redhat.com> * upload local-exec to use only one step While I believe in simplicity and diving this into steps makes it easier for debugging and extensibility. Addressing comments on this PR consolidating all train job scripts into one and running it as single step to match k8s. Signed-off-by: Saad Zaher <szaher@redhat.com> * optimize loops when getting runtime Signed-off-by: Saad Zaher <szaher@redhat.com> * add LocalRuntimeTrainer Signed-off-by: Saad Zaher <szaher@redhat.com> * rename cleanup config item to cleanup_venv Signed-off-by: Saad Zaher <szaher@redhat.com> * convert local runtime to runtime Signed-off-by: Saad Zaher <szaher@redhat.com> * convert runtimes before returning Signed-off-by: Saad Zaher <szaher@redhat.com> * fix get_job_logs to align with parent interface Signed-off-by: Saad Zaher <szaher@redhat.com> * rename get_runtime_trainer func Signed-off-by: Saad Zaher <szaher@redhat.com> * rename get_training_job_command to get_local_train_job_script Signed-off-by: Saad Zaher <szaher@redhat.com> * Ignore failures in Coveralls action Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Saad Zaher <szaher@redhat.com> Signed-off-by: Saad Zaher <eng.szaher@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot added the do-not-merge/work-in-progress label Jun 21, 2025

google-oss-prow bot requested review from andreyvelich, astefanutti and tenzen-y June 21, 2025 14:06

google-oss-prow bot added the size/XXL label Jun 21, 2025

Implement TrainerClient Backends & Local Process

3524abc

Signed-off-by: Saad Zaher <szaher@redhat.com>

szaher force-pushed the training-backends branch from 7ab03e1 to 3524abc Compare June 21, 2025 14:07

szaher added 2 commits June 21, 2025 17:47

Merge branch 'main' of github.com:kubeflow/sdk into training-backends

0f4e504

Implement Job Cancellation

908af68

Signed-off-by: Saad Zaher <szaher@redhat.com>

szaher changed the title ~~[WIP] Implement TrainerClient Backends & Local Process~~ Implement TrainerClient Backends & Local Process Jun 25, 2025

google-oss-prow bot removed the do-not-merge/work-in-progress label Jun 25, 2025

Merge branch 'main' into training-backends

71e83ae

Signed-off-by: Saad Zaher <szaher@redhat.com>

szaher mentioned this pull request Jul 3, 2025

feat: KEP-2 Local Execution Mode Proposal #34

Merged

1 task

eoinfennessy reviewed Jul 7, 2025

View reviewed changes

python/kubeflow/trainer/api/trainer_client.py Outdated Show resolved Hide resolved

eoinfennessy suggested changes Jul 8, 2025

View reviewed changes

python/kubeflow/trainer/api/trainer_client.py Outdated Show resolved Hide resolved

python/kubeflow/trainer/api/trainer_client.py Outdated Show resolved Hide resolved

andreyvelich reviewed Jul 8, 2025

View reviewed changes

update local job to add resouce limitation in k8s style

3d578c7

Signed-off-by: Saad Zaher <szaher@redhat.com>

szaher commented Jul 9, 2025

View reviewed changes

This comment was marked as duplicate.

Sign in to view

Update python/kubeflow/trainer/api/trainer_client.py

bed8f70

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Saad Zaher <szaher@redhat.com>

kramaranya reviewed Jul 23, 2025

View reviewed changes

kramaranya mentioned this pull request Jul 23, 2025

[Release] Kubeflow SDK 0.1 Release #45

Closed

12 tasks

franciscojavierarceo requested review from franciscojavierarceo and mprahl July 24, 2025 14:40

andreyvelich reviewed Sep 11, 2025

View reviewed changes

szaher added 4 commits September 13, 2025 13:09

optimize loops when getting runtime

74d60a4

Signed-off-by: Saad Zaher <szaher@redhat.com>

add LocalRuntimeTrainer

9d9a14c

Signed-off-by: Saad Zaher <szaher@redhat.com>

rename cleanup config item to cleanup_venv

60d96d0

Signed-off-by: Saad Zaher <szaher@redhat.com>

andreyvelich reviewed Sep 14, 2025

View reviewed changes

kubeflow/trainer/backends/localprocess/backend.py Outdated Show resolved Hide resolved

kubeflow/trainer/backends/localprocess/backend.py Outdated Show resolved Hide resolved

kubeflow/trainer/backends/localprocess/types.py Outdated Show resolved Hide resolved

convert local runtime to runtime

8e9190e

Signed-off-by: Saad Zaher <szaher@redhat.com>

andreyvelich reviewed Sep 14, 2025

View reviewed changes

szaher added 3 commits September 14, 2025 16:39

convert runtimes before returning

ac0be0c

Signed-off-by: Saad Zaher <szaher@redhat.com>

fix get_job_logs to align with parent interface

4b1db8c

Signed-off-by: Saad Zaher <szaher@redhat.com>

rename get_runtime_trainer func

4fe7baa

Signed-off-by: Saad Zaher <szaher@redhat.com>

andreyvelich reviewed Sep 14, 2025

View reviewed changes

google-oss-prow bot assigned andreyvelich Sep 14, 2025

google-oss-prow bot added the lgtm label Sep 14, 2025

rename get_training_job_command to get_local_train_job_script

9775f3f

Signed-off-by: Saad Zaher <szaher@redhat.com>

google-oss-prow bot added approved and removed lgtm labels Sep 14, 2025

google-oss-prow bot added the lgtm label Sep 14, 2025

Ignore failures in Coveralls action

42c7769

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

google-oss-prow bot removed the lgtm label Sep 15, 2025

andreyvelich reviewed Sep 15, 2025

View reviewed changes

google-oss-prow bot added the lgtm label Sep 15, 2025

google-oss-prow bot merged commit ffc3d62 into kubeflow:main Sep 15, 2025
10 checks passed

andreyvelich mentioned this pull request Sep 15, 2025

feat(trainer): Introduce LocalTrainerClient #13

Closed

1 task

feat: Implement TrainerClient Backends & Local Process #33

feat: Implement TrainerClient Backends & Local Process #33

Conversation

szaher commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

google-oss-prow bot commented Jul 8, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szaher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as duplicate.

Uh oh!

szaher commented Jul 9, 2025

Uh oh!

briangallagher commented Jul 17, 2025

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kramaranya commented Jul 23, 2025

Uh oh!

google-oss-prow bot commented Jul 23, 2025

Uh oh!

kramaranya commented Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Sep 14, 2025

Uh oh!

andreyvelich commented Sep 14, 2025

Uh oh!

andreyvelich Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

szaher commented Jun 21, 2025 •

edited

Loading

andreyvelich Sep 15, 2025 •

edited

Loading