-
Notifications
You must be signed in to change notification settings - Fork 41
feat: KEP-2 Local Execution Mode Proposal #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Saad Zaher <szaher@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review @szaher, I left a few comments.
/assign @kubeflow/wg-training-leads @astefanutti @Electronic-Waste @akshaychitneni @shravan-achar
* add apple containers
* fix typo in Subprocess
* add API consistency to the design details
Signed-off-by: Saad Zaher <szaher@redhat.com>
|
Thanks for your review @andreyvelich I did update my branch and fixed some and answered your questions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @szaher!
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <szaher@redhat.com>
|
@szaher could you please move this to docs/proposals? |
Signed-off-by: Saad Zaher <szaher@redhat.com>
|
|
||
| The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization. | ||
|
|
||
| - Different training backends will need to implement the same interface from the `TrainingBackend` abstract class so `TrainerClient` can initialize and load the backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szaher @astefanutti @kramaranya Shall we call it RunnerBackend or just Runner abstract class to be consistent with KFP: https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/execute-kfp-pipelines-locally/#runner-types ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very specific to the trainer not generic. The Runner or RunnerBackend makes sense in pipelines case but for trainer, I believe it makes more sense to use TrainingBackend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we migrate other type of Jobs like OptimizeJob for hyperparameter tuning, we might want to introduce various backends for them as well, right ? I am trying to find a name that works for all type of Jobs. So users can quickly understand that they will use local backend for their training jobs, optimization jobs, ML pipelines.
Alternatively, we can call it ExecutionBackend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed it to ExecutionBackend for now. I believe at some point we might want to move the common local execution to be on the root package i.e.
from kubeflow.local import ExecutionBackendThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExecutionBackend makes sense to me
Signed-off-by: Saad Zaher <eng.szaher@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates @szaher!
/lgtm
/assign @kramaranya @astefanutti @Electronic-Waste
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @szaher!
I left a few nits
|
|
||
| ## Summary | ||
|
|
||
| This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure. | |
| This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing AI Practitioners to test and experiment with their models locally before submitting them to a kubernetes based infrastructure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| ## Summary | ||
|
|
||
| This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure. | ||
| The feature will enable ML engineers to use Subprocess, Podman, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The feature will enable ML engineers to use Subprocess, Podman, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources. | |
| The feature will enable AI Practitioners to use Subprocess, Podman, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| Currently, Kubeflow’s Trainer SDK requires jobs to be executed on a Kubernetes cluster. | ||
| This setup can incur significant costs and time delays, especially for model experiments that are in the early stages. | ||
| ML engineers often want to experiment locally before scaling their models to a full cloud-based infrastructure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ML engineers often want to experiment locally before scaling their models to a full cloud-based infrastructure. | |
| AI Practitioners often want to experiment locally before scaling their models to a full cloud-based infrastructure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| ### Goals | ||
| - Allow users to run training jobs on their local machines using container runtimes or subprocess. | ||
| - Rework current Kubeflow Trainer SDK to implement Execution Backends with Kubernetes Backend as default. | ||
| - Implement Local Execution/Training Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could just keep Execution Backends for now
| - Implement Local Execution/Training Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes. | |
| - Implement Local Execution Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in Start element
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| ### User Stories (Optional) | ||
|
|
||
| #### Story 1 | ||
| As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster. | |
| As an AI Practitioner, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster. | ||
|
|
||
| #### Story 2 | ||
| As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment. | |
| As an AI Practitioner, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment. | ||
|
|
||
| ### Notes/Constraints/Caveats | ||
| - The local execution mode will initially support Subprocess, Podman, Docker and Apple Container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to initially support Apple Container though? And what does initially mean? cc @andreyvelich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not initially, @szaher maybe we can say that we will investigate other runtime engines such as Container in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization. | ||
|
|
||
| - Different training backends will need to implement the same interface from the `TrainingBackend` abstract class so `TrainerClient` can initialize and load the backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExecutionBackend makes sense to me
|
|
||
| ## Design Details | ||
|
|
||
| The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization. | |
| The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers and virtual environment isolation. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
/milestone v0.1 |
|
@szaher Can you address @kramaranya comments please ? |
Signed-off-by: Saad Zaher <szaher@redhat.com>
|
@andreyvelich all comments addressed |
|
@andreyvelich @astefanutti @Electronic-Waste @kramaranya appreciate reviews to close this one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @szaher!
/lgtm
/assign @kramaranya
|
Thank you @szaher! |
Signed-off-by: Saad Zaher <szaher@redhat.com>
|
Awesome! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* KEP-2: Local Execution Mode Proposal
Signed-off-by: Saad Zaher <szaher@redhat.com>
* Updated proposal
* add apple containers
* fix typo in Subprocess
* add API consistency to the design details
Signed-off-by: Saad Zaher <szaher@redhat.com>
* update proposal to use training backends
Signed-off-by: Saad Zaher <szaher@redhat.com>
* add constraint on resource limitation for local mode
Signed-off-by: Saad Zaher <szaher@redhat.com>
* Move proposals into docs
Signed-off-by: Saad Zaher <szaher@redhat.com>
* Use ExecutionBackends instead of TrainingBackends
Signed-off-by: Saad Zaher <eng.szaher@gmail.com>
* update docs and graphs
Signed-off-by: Saad Zaher <szaher@redhat.com>
* update graphs
Signed-off-by: Saad Zaher <szaher@redhat.com>
---------
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Saad Zaher <eng.szaher@gmail.com>
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #
Checklist: