Skip to content

Conversation

@szaher
Copy link
Member

@szaher szaher commented Jun 21, 2025

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: Saad Zaher <szaher@redhat.com>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review @szaher, I left a few comments.
/assign @kubeflow/wg-training-leads @astefanutti @Electronic-Waste @akshaychitneni @shravan-achar

    * add apple containers
    * fix typo in Subprocess
    * add API consistency to the design details

Signed-off-by: Saad Zaher <szaher@redhat.com>
@google-oss-prow google-oss-prow bot added size/XL and removed size/M labels Jun 30, 2025
@szaher
Copy link
Member Author

szaher commented Jun 30, 2025

Thanks for your review @andreyvelich I did update my branch and fixed some and answered your questions.

Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @szaher!

szaher added 2 commits August 22, 2025 02:23
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
@kramaranya
Copy link
Contributor

@szaher could you please move this to docs/proposals?

Signed-off-by: Saad Zaher <szaher@redhat.com>
@szaher szaher changed the title KEP-2: Local Execution Mode Proposal docs: KEP-2 Local Execution Mode Proposal Aug 22, 2025
@szaher szaher changed the title docs: KEP-2 Local Execution Mode Proposal feat: KEP-2 Local Execution Mode Proposal Aug 22, 2025

The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.

- Different training backends will need to implement the same interface from the `TrainingBackend` abstract class so `TrainerClient` can initialize and load the backend.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very specific to the trainer not generic. The Runner or RunnerBackend makes sense in pipelines case but for trainer, I believe it makes more sense to use TrainingBackend

Copy link
Member

@andreyvelich andreyvelich Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we migrate other type of Jobs like OptimizeJob for hyperparameter tuning, we might want to introduce various backends for them as well, right ? I am trying to find a name that works for all type of Jobs. So users can quickly understand that they will use local backend for their training jobs, optimization jobs, ML pipelines.

Alternatively, we can call it ExecutionBackend.

Copy link
Member Author

@szaher szaher Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed it to ExecutionBackend for now. I believe at some point we might want to move the common local execution to be on the root package i.e.

from kubeflow.local import ExecutionBackend

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecutionBackend makes sense to me

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates @szaher!
/lgtm
/assign @kramaranya @astefanutti @Electronic-Waste

Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @szaher!
I left a few nits


## Summary

This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure.
This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing AI Practitioners to test and experiment with their models locally before submitting them to a kubernetes based infrastructure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

## Summary

This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure.
The feature will enable ML engineers to use Subprocess, Podman, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The feature will enable ML engineers to use Subprocess, Podman, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources.
The feature will enable AI Practitioners to use Subprocess, Podman, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Currently, Kubeflow’s Trainer SDK requires jobs to be executed on a Kubernetes cluster.
This setup can incur significant costs and time delays, especially for model experiments that are in the early stages.
ML engineers often want to experiment locally before scaling their models to a full cloud-based infrastructure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ML engineers often want to experiment locally before scaling their models to a full cloud-based infrastructure.
AI Practitioners often want to experiment locally before scaling their models to a full cloud-based infrastructure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

### Goals
- Allow users to run training jobs on their local machines using container runtimes or subprocess.
- Rework current Kubeflow Trainer SDK to implement Execution Backends with Kubernetes Backend as default.
- Implement Local Execution/Training Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could just keep Execution Backends for now

Suggested change
- Implement Local Execution/Training Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes.
- Implement Local Execution Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in Start element

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

### User Stories (Optional)

#### Story 1
As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.
As an AI Practitioner, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.

#### Story 2
As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.
As an AI Practitioner, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.

### Notes/Constraints/Caveats
- The local execution mode will initially support Subprocess, Podman, Docker and Apple Container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to initially support Apple Container though? And what does initially mean? cc @andreyvelich

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not initially, @szaher maybe we can say that we will investigate other runtime engines such as Container in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.

- Different training backends will need to implement the same interface from the `TrainingBackend` abstract class so `TrainerClient` can initialize and load the backend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExecutionBackend makes sense to me


## Design Details

The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.
The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers and virtual environment isolation. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kramaranya
Copy link
Contributor

/milestone v0.1

@google-oss-prow google-oss-prow bot added this to the v0.1 milestone Sep 1, 2025
@andreyvelich
Copy link
Member

@szaher Can you address @kramaranya comments please ?

Signed-off-by: Saad Zaher <szaher@redhat.com>
@szaher
Copy link
Member Author

szaher commented Sep 8, 2025

@andreyvelich all comments addressed

@szaher
Copy link
Member Author

szaher commented Sep 8, 2025

@andreyvelich @astefanutti @Electronic-Waste @kramaranya appreciate reviews to close this one

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @szaher!
/lgtm
/assign @kramaranya

@google-oss-prow google-oss-prow bot added the lgtm label Sep 8, 2025
@kramaranya
Copy link
Contributor

Thank you @szaher!
/lgtm

Signed-off-by: Saad Zaher <szaher@redhat.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Sep 9, 2025
@kramaranya
Copy link
Contributor

Awesome!
/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Sep 9, 2025
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 1290f5d into kubeflow:main Sep 9, 2025
10 checks passed
accorvin pushed a commit to opendatahub-io/kubeflow-sdk that referenced this pull request Oct 8, 2025
* KEP-2: Local Execution Mode Proposal

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Updated proposal

    * add apple containers
    * fix typo in Subprocess
    * add API consistency to the design details

Signed-off-by: Saad Zaher <szaher@redhat.com>

* update proposal to use training backends

Signed-off-by: Saad Zaher <szaher@redhat.com>

* add constraint on resource limitation for local mode

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Move proposals into docs

Signed-off-by: Saad Zaher <szaher@redhat.com>

* Use ExecutionBackends instead of TrainingBackends

Signed-off-by: Saad Zaher <eng.szaher@gmail.com>

* update docs and graphs

Signed-off-by: Saad Zaher <szaher@redhat.com>

* update graphs

Signed-off-by: Saad Zaher <szaher@redhat.com>

---------

Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <eng.szaher@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants