feat: KEP-2 Local Execution Mode Proposal #34

szaher · 2025-06-21T17:09:26Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

Docs included if any changes are user facing

Signed-off-by: Saad Zaher <szaher@redhat.com>

andreyvelich

Sorry for the late review @szaher, I left a few comments.
/assign @kubeflow/wg-training-leads @astefanutti @Electronic-Waste @akshaychitneni @shravan-achar

proposals/2-trainer-local-execution/README.md

* add apple containers * fix typo in Subprocess * add API consistency to the design details Signed-off-by: Saad Zaher <szaher@redhat.com>

szaher · 2025-06-30T14:39:40Z

Thanks for your review @andreyvelich I did update my branch and fixed some and answered your questions.

kramaranya

Thank you @szaher!

proposals/2-trainer-local-execution/README.md

Signed-off-by: Saad Zaher <szaher@redhat.com>

kramaranya · 2025-08-22T14:26:59Z

@szaher could you please move this to docs/proposals?

Signed-off-by: Saad Zaher <szaher@redhat.com>

andreyvelich · 2025-08-22T18:32:50Z

docs/proposals/2-trainer-local-execution/README.md

+
+The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.
+
+- Different training backends will need to implement the same interface from the `TrainingBackend` abstract class so `TrainerClient` can initialize and load the backend.


@szaher @astefanutti @kramaranya Shall we call it RunnerBackend or just Runner abstract class to be consistent with KFP: https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/execute-kfp-pipelines-locally/#runner-types ?

This is very specific to the trainer not generic. The Runner or RunnerBackend makes sense in pipelines case but for trainer, I believe it makes more sense to use TrainingBackend

Once we migrate other type of Jobs like OptimizeJob for hyperparameter tuning, we might want to introduce various backends for them as well, right ? I am trying to find a name that works for all type of Jobs. So users can quickly understand that they will use local backend for their training jobs, optimization jobs, ML pipelines.

Alternatively, we can call it ExecutionBackend.

I have renamed it to ExecutionBackend for now. I believe at some point we might want to move the common local execution to be on the root package i.e.

from kubeflow.local import ExecutionBackend

ExecutionBackend makes sense to me

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

andreyvelich

Thank you for the updates @szaher!
/lgtm
/assign @kramaranya @astefanutti @Electronic-Waste

kramaranya

Thank you, @szaher!
I left a few nits

kramaranya · 2025-08-27T10:18:09Z

docs/proposals/2-trainer-local-execution/README.md

+
+## Summary
+
+This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure.


Suggested change

This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure.

This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing AI Practitioners to test and experiment with their models locally before submitting them to a kubernetes based infrastructure.

kramaranya · 2025-08-27T10:18:36Z

docs/proposals/2-trainer-local-execution/README.md

+## Summary
+
+This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure.
+The feature will enable ML engineers to use Subprocess, Podman, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources.


Suggested change

The feature will enable ML engineers to use Subprocess, Podman, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources.

The feature will enable AI Practitioners to use Subprocess, Podman, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources.

kramaranya · 2025-08-27T10:31:13Z

docs/proposals/2-trainer-local-execution/README.md

+
+Currently, Kubeflow’s Trainer SDK requires jobs to be executed on a Kubernetes cluster.
+This setup can incur significant costs and time delays, especially for model experiments that are in the early stages.
+ML engineers often want to experiment locally before scaling their models to a full cloud-based infrastructure.


Suggested change

ML engineers often want to experiment locally before scaling their models to a full cloud-based infrastructure.

AI Practitioners often want to experiment locally before scaling their models to a full cloud-based infrastructure.

kramaranya · 2025-08-27T10:48:21Z

docs/proposals/2-trainer-local-execution/README.md

+### Goals
+- Allow users to run training jobs on their local machines using container runtimes or subprocess.
+- Rework current Kubeflow Trainer SDK to implement Execution Backends with Kubernetes Backend as default.
+- Implement Local Execution/Training Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes.


I think we could just keep Execution Backends for now

Suggested change

- Implement Local Execution/Training Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes.

- Implement Local Execution Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes.

kramaranya · 2025-08-27T11:26:02Z

docs/proposals/2-trainer-local-execution/high-level-arch.svg

There's a typo in Start element

kramaranya · 2025-08-27T11:29:53Z

docs/proposals/2-trainer-local-execution/README.md

+### User Stories (Optional)
+
+#### Story 1
+As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.


Suggested change

As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.

As an AI Practitioner, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.

kramaranya · 2025-08-27T11:30:12Z

docs/proposals/2-trainer-local-execution/README.md

+As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.
+
+#### Story 2
+As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.


Suggested change

As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.

As an AI Practitioner, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.

kramaranya · 2025-08-27T11:31:52Z

docs/proposals/2-trainer-local-execution/README.md

+As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.
+
+### Notes/Constraints/Caveats
+- The local execution mode will initially support Subprocess, Podman, Docker and Apple Container.


Do we plan to initially support Apple Container though? And what does initially mean? cc @andreyvelich

Not initially, @szaher maybe we can say that we will investigate other runtime engines such as Container in the future.

kramaranya · 2025-08-27T11:35:39Z

docs/proposals/2-trainer-local-execution/README.md

+
+The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.
+
+- Different training backends will need to implement the same interface from the `TrainingBackend` abstract class so `TrainerClient` can initialize and load the backend.


ExecutionBackend makes sense to me

kramaranya · 2025-08-27T11:38:19Z

docs/proposals/2-trainer-local-execution/README.md

+
+## Design Details
+
+The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.


Suggested change

The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.

The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers and virtual environment isolation. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.

kramaranya · 2025-09-01T22:29:25Z

/milestone v0.1

andreyvelich · 2025-09-07T22:32:03Z

@szaher Can you address @kramaranya comments please ?

Signed-off-by: Saad Zaher <szaher@redhat.com>

szaher · 2025-09-08T23:25:50Z

@andreyvelich all comments addressed

szaher · 2025-09-08T23:26:21Z

@andreyvelich @astefanutti @Electronic-Waste @kramaranya appreciate reviews to close this one

…l-exec-proposal

andreyvelich

Thanks @szaher!
/lgtm
/assign @kramaranya

kramaranya · 2025-09-09T09:01:42Z

Thank you @szaher!
/lgtm

Signed-off-by: Saad Zaher <szaher@redhat.com>

kramaranya · 2025-09-09T09:17:56Z

Awesome!
/lgtm

andreyvelich

/approve

google-oss-prow · 2025-09-09T10:10:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* KEP-2: Local Execution Mode Proposal Signed-off-by: Saad Zaher <szaher@redhat.com> * Updated proposal * add apple containers * fix typo in Subprocess * add API consistency to the design details Signed-off-by: Saad Zaher <szaher@redhat.com> * update proposal to use training backends Signed-off-by: Saad Zaher <szaher@redhat.com> * add constraint on resource limitation for local mode Signed-off-by: Saad Zaher <szaher@redhat.com> * Move proposals into docs Signed-off-by: Saad Zaher <szaher@redhat.com> * Use ExecutionBackends instead of TrainingBackends Signed-off-by: Saad Zaher <eng.szaher@gmail.com> * update docs and graphs Signed-off-by: Saad Zaher <szaher@redhat.com> * update graphs Signed-off-by: Saad Zaher <szaher@redhat.com> --------- Signed-off-by: Saad Zaher <szaher@redhat.com> Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

KEP-2: Local Execution Mode Proposal

8378fbc

Signed-off-by: Saad Zaher <szaher@redhat.com>

google-oss-prow bot requested review from Electronic-Waste, andreyvelich and astefanutti June 21, 2025 17:09

google-oss-prow bot added the size/M label Jun 21, 2025

andreyvelich reviewed Jun 27, 2025

View reviewed changes

Updated proposal

e59b17f

* add apple containers * fix typo in Subprocess * add API consistency to the design details Signed-off-by: Saad Zaher <szaher@redhat.com>

google-oss-prow bot added size/XL and removed size/M labels Jun 30, 2025

kramaranya reviewed Jul 20, 2025

View reviewed changes

proposals/2-trainer-local-execution/README.md Outdated Show resolved Hide resolved

proposals/2-trainer-local-execution/README.md Outdated Show resolved Hide resolved

proposals/2-trainer-local-execution/README.md Outdated Show resolved Hide resolved

szaher added 2 commits August 22, 2025 02:23

update proposal to use training backends

b64c955

Signed-off-by: Saad Zaher <szaher@redhat.com>

add constraint on resource limitation for local mode

336af96

Signed-off-by: Saad Zaher <szaher@redhat.com>

Move proposals into docs

9e1ac06

Signed-off-by: Saad Zaher <szaher@redhat.com>

szaher changed the title ~~KEP-2: Local Execution Mode Proposal~~ docs: KEP-2 Local Execution Mode Proposal Aug 22, 2025

szaher changed the title ~~docs: KEP-2 Local Execution Mode Proposal~~ feat: KEP-2 Local Execution Mode Proposal Aug 22, 2025

andreyvelich reviewed Aug 22, 2025

View reviewed changes

andreyvelich mentioned this pull request Aug 22, 2025

feat: Implement Kubernetes Backend #68

Merged

1 task

Use ExecutionBackends instead of TrainingBackends

c2e51bf

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

andreyvelich reviewed Aug 26, 2025

View reviewed changes

google-oss-prow bot assigned astefanutti, Electronic-Waste, kramaranya and andreyvelich Aug 26, 2025

google-oss-prow bot added the lgtm label Aug 26, 2025

kramaranya mentioned this pull request Aug 8, 2025

[Release] Kubeflow SDK 0.1 Release #45

Closed

12 tasks

kramaranya reviewed Aug 27, 2025

View reviewed changes

google-oss-prow bot added this to the v0.1 milestone Sep 1, 2025

update docs and graphs

ef4a4a7

Signed-off-by: Saad Zaher <szaher@redhat.com>

Merge branch 'local-exec-proposal' of github.com:szaher/sdk into loca…

93816d7

…l-exec-proposal

google-oss-prow bot added size/XXL and removed lgtm size/XL labels Sep 8, 2025

andreyvelich reviewed Sep 8, 2025

View reviewed changes

google-oss-prow bot added the lgtm label Sep 8, 2025

update graphs

f54795e

Signed-off-by: Saad Zaher <szaher@redhat.com>

google-oss-prow bot removed the lgtm label Sep 9, 2025

google-oss-prow bot added the lgtm label Sep 9, 2025

andreyvelich reviewed Sep 9, 2025

View reviewed changes

google-oss-prow bot added the approved label Sep 9, 2025

google-oss-prow bot merged commit 1290f5d into kubeflow:main Sep 9, 2025
10 checks passed

andreyvelich mentioned this pull request Oct 15, 2025

feat: Hyperparameter Optimization APIs in Kubeflow SDK #124

Open

3 tasks


		The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.

		- Different training backends will need to implement the same interface from the `TrainingBackend` abstract class so `TrainerClient` can initialize and load the backend.


		## Summary

		This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure.

	The feature will enable ML engineers to use Subprocess, Podman, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources.
	The feature will enable AI Practitioners to use Subprocess, Podman, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources.

	ML engineers often want to experiment locally before scaling their models to a full cloud-based infrastructure.
	AI Practitioners often want to experiment locally before scaling their models to a full cloud-based infrastructure.

	- Implement Local Execution/Training Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes.
	- Implement Local Execution Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes.

	As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.
	As an AI Practitioner, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.

	As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.
	As an AI Practitioner, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.


		## Design Details

		The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.

feat: KEP-2 Local Execution Mode Proposal #34

feat: KEP-2 Local Execution Mode Proposal #34

Conversation

szaher commented Jun 21, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szaher commented Jun 30, 2025

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kramaranya commented Aug 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szaher Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Aug 22, 2025 •

edited

Loading

szaher Aug 26, 2025 •

edited

Loading