-
Notifications
You must be signed in to change notification settings - Fork 41
feat: Add ContainerBackend with Docker and Podman #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Brian Gallagher <briangal@gmail.com>
Signed-off-by: Fiona Waters <fiwaters6@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @Fiona-Waters!
As we discussed here: #111 (comment), can we consolidate Podman and Docker under single container backend ?
Given that those backend should have similar APIs, I think it would be better to consolidate them, similar to KFP: https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/execute-kfp-pipelines-locally/#runner-dockerrunner
Thanks @andreyvelich I will look at updating the implementation. |
|
@andreyvelich @astefanutti regarding comments on this PR and #111 this is what I propose: We have 3 backends:
For the Local Container backend we automatically try Docker first, then Podman and then fallback to Subprocess if neither runtime is available. We use the adapter pattern with a container client adapter unified interface, and docker and podman specific calls are implemented in separate adapter classes. |
|
Sure, that looks great @Fiona-Waters!
Why do we need to fallback to subprocess ? In the ContainerBackend users can select: |
|
@Fiona-Waters that sounds to me. I agree the fallback logic may really apply to choose the default container runtime. Other than that, I'd be inclined to drop the "Local" prefix entirely. Even Kubernetes could run local with KinD, and I doubt the SDK will ever do remote process. |
Understood. Let me see what I can do. Thank you for the swift reply! |
Ok cool. Let me see what I can do. Thank you! |
1f7c066 to
bdde877
Compare
Signed-off-by: Fiona Waters <fiwaters6@gmail.com>
bdde877 to
d1288b2
Compare
|
@andreyvelich @astefanutti @briangallagher |
Signed-off-by: Fiona Waters <fiwaters6@gmail.com>
|
/ok-to-test |
| ) | ||
|
|
||
| # Store job in backend | ||
| self._jobs[job_name] = _Job( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that be possible to avoid relying on that in-memory "registry" and consistently rely on the state from the container runtime itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated to store metadata as labels on containers and networks allowing us to query the container runtime for all job information. See this commit - please let me know what you think. Thanks
… memory Signed-off-by: Fiona Waters <fiwaters6@gmail.com>
defd7eb to
24f0bbd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fiona-Waters thanks for this awesome work!
That looks good to me overall.
/assign @kubeflow/kubeflow-sdk-team @briangallagher
|
|
||
| from kubeflow.trainer.types import types as base_types | ||
|
|
||
| LOCAL_RUNTIMES_DIR = Path(__file__).parents[1] / "config" / "local_runtimes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
| LOCAL_RUNTIMES_DIR = Path(__file__).parents[1] / "config" / "local_runtimes" | |
| LOCAL_RUNTIMES_DIR = Path(__file__).parents[1] / "config" / "container_runtimes" |
otherwise could be confusing for local process backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Fiona-Waters!
I left my initial messages.
| print("\n".join(TrainerClient().get_job_logs(name=job_id))) | ||
| ``` | ||
|
|
||
| ## Local Development |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also include docs about Trainer local execution to the user guides ?
https://www.kubeflow.org/docs/components/trainer/user-guides/
you can also add info from the @szaher PR: #95
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIP PR for this kubeflow/website#4221
| @@ -0,0 +1,162 @@ | |||
| # ContainerBackend | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest that we move this docs to the user guides for now: https://www.kubeflow.org/docs/components/trainer/user-guides/ as we discussed here: #95 (comment)
| if self.cfg.runtime: | ||
| # User specified a runtime explicitly | ||
| if self.cfg.runtime == "docker": | ||
| adapter = DockerClientAdapter(self.cfg.container_host) | ||
| adapter.ping() | ||
| logger.info("Using Docker as container runtime") | ||
| return adapter | ||
| elif self.cfg.runtime == "podman": | ||
| adapter = PodmanClientAdapter(self.cfg.container_host) | ||
| adapter.ping() | ||
| logger.info("Using Podman as container runtime") | ||
| return adapter | ||
| else: | ||
| # Auto-detect: try Docker first, then Podman | ||
| try: | ||
| adapter = DockerClientAdapter(self.cfg.container_host) | ||
| adapter.ping() | ||
| logger.info("Using Docker as container runtime") | ||
| return adapter | ||
| except Exception as docker_error: | ||
| logger.debug(f"Docker initialization failed: {docker_error}") | ||
| try: | ||
| adapter = PodmanClientAdapter(self.cfg.container_host) | ||
| adapter.ping() | ||
| logger.info("Using Podman as container runtime") | ||
| return adapter | ||
| except Exception as podman_error: | ||
| logger.debug(f"Podman initialization failed: {podman_error}") | ||
| raise RuntimeError( | ||
| "Neither Docker nor Podman is available. " | ||
| "Please install Docker or Podman, or use LocalProcessBackendConfig instead." | ||
| ) from podman_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, this can be simplified as follows:
runtime_map = {
"docker": DockerClientAdapter,
"podman": PodmanClientAdapter,
}
def get_adapter(cfg):
runtimes_to_try = [cfg.runtime] if cfg.runtime else ["docker", "podman"]
last_error = None
for runtime_name in runtimes_to_try:
if runtime_name not in runtime_map:
continue
try:
adapter = runtime_map[runtime_name](cfg.container_host)
adapter.ping()
logger.info(f"Using {runtime_name} as container runtime")
return adapter
except Exception as e:
logger.debug(f"{runtime_name} initialization failed: {e}")
last_error = e
raise RuntimeError(
"Neither Docker nor Podman is available. "
"Please install Docker or Podman, or use LocalProcessBackendConfig instead."
) from last_error| runtime: types.Runtime | None = None, | ||
| initializer: types.Initializer | None = None, | ||
| trainer: types.CustomTrainer | types.BuiltinTrainer | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use | since we still support Python 3.9 for now. Let's be consistent across backends:
| runtime: Optional[types.Runtime] = None, |
| @@ -0,0 +1,25 @@ | |||
| apiVersion: trainer.kubeflow.org/v1alpha1 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of installing the runtimes, can we just read the image version from GitHub dynamically ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me look into that. For offline support should we fall back to providing this?
| image: Optional[str] = Field(default=None) | ||
| pull_policy: str = Field(default="IfNotPresent") | ||
| auto_remove: bool = Field(default=True) | ||
| gpus: Optional[Union[int, bool]] = Field(default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How this is used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allows you to override the default container image specified in the ClusterTrainingRuntime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think actually you are referring to gpus. It was pulled over from a previous iteration but isn't being used currently. I can update to include GPU support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this as there can be multiple ClusterTraininingRuntimes and it's defined by TrainJobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies I understand now that it doesn't make sense to allow the image to be updated here as it is not per job. Will remove this. Thank you.
| gpus: Optional[Union[int, bool]] = Field(default=None) | ||
| env: Optional[dict[str, str]] = Field(default=None) | ||
| container_host: Optional[str] = Field(default=None) | ||
| workdir_base: Optional[str] = Field(default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we initially use the default dir and if users require to configure it we will give them such option ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove it for now and use the default dir.
| pull_policy: str = Field(default="IfNotPresent") | ||
| auto_remove: bool = Field(default=True) | ||
| gpus: Optional[Union[int, bool]] = Field(default=None) | ||
| env: Optional[dict[str, str]] = Field(default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we start a new container every time Job is submitted ?
If yes, this might be controlled via train() API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right - similar to the image param. Will remove this. Thanks
| env: Optional[dict[str, str]] = Field(default=None) | ||
| container_host: Optional[str] = Field(default=None) | ||
| workdir_base: Optional[str] = Field(default=None) | ||
| runtime: Optional[Literal["docker", "podman"]] = Field(default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make it less confusing with TrainingRuntime, can we name it:
| runtime: Optional[Literal["docker", "podman"]] = Field(default=None) | |
| container_runtime: Optional[Literal["docker", "podman"]] = Field(default="docker") |
| from collections.abc import Iterator | ||
|
|
||
|
|
||
| class ContainerClientAdapter(abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call it as BaseContainerClientAdapter(), similar to:
sdk/kubeflow/trainer/backends/base.py
Line 23 in 24f0bbd
| class ExecutionBackend(abc.ABC): |
I would suggest, we move them to subdirectory:
container/adapters/base.py
container/adapters/docker.py
container/adapters/podman.py
WDYT @Fiona-Waters ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good idea, will do.
BTW thank you for your review. Will update the PR tomorrow hopefully.
4e195ba to
34a4d10
Compare
|
@andreyvelich I have addressed all of your comments, please review again when you can. I have removed the README.md and will add it along with docs on local execution to the user guides. |
Signed-off-by: Fiona Waters <fiwaters6@gmail.com>
34a4d10 to
a9218c5
Compare
| """ | ||
| Create per-job working directory on host. | ||
| Working directories are created under ~/.kubeflow_trainer/localcontainer/<job_name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ~/.kubeflow/trainer/containers/... ?
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| CONTAINER_RUNTIMES_DIR = Path(__file__).parents[1] / "config" / "container_runtimes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've suggested "container_runtimes" before, but looking at it maybe training_runtimes would be more appropriate?
| logger = logging.getLogger(__name__) | ||
|
|
||
| CONTAINER_RUNTIMES_DIR = Path(__file__).parents[1] / "config" / "container_runtimes" | ||
| CACHE_DIR = Path.home() / ".kubeflow_trainer" / "runtime_cache" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ~/.kubeflow/trainer/cache?
|
|
||
| # GitHub runtimes configuration | ||
| GITHUB_RUNTIMES_BASE_URL = ( | ||
| "https://raw.githubusercontent.com/kubeflow/trainer/master/manifests/base/runtimes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for later, we should probably find a way rely on released runtimes.
| @@ -0,0 +1,417 @@ | |||
| # Copyright 2025 The Kubeflow Authors. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
training_runtime_loader.py?
| gpus: Optional[Union[int, bool]] = Field(default=None) | ||
| container_host: Optional[str] = Field(default=None) | ||
| container_runtime: Optional[Literal["docker", "podman"]] = Field(default=None) | ||
| use_github_runtimes: bool = Field(default=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can be more structured here and have a training_runtimes argument, that could have different options, like one to some URLs maybe.
| """ | ||
|
|
||
| from kubeflow.trainer.backends.container_runtime_loader import ( | ||
| CONTAINER_RUNTIMES_DIR, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| CONTAINER_RUNTIMES_DIR, | |
| TRAINING_RUNTIMES_DIR, |
|
|
||
| from kubeflow.trainer.backends.container_runtime_loader import ( | ||
| CONTAINER_RUNTIMES_DIR, | ||
| get_container_runtime, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| get_container_runtime, | |
| get_training_runtime, |
| from kubeflow.trainer.backends.container_runtime_loader import ( | ||
| CONTAINER_RUNTIMES_DIR, | ||
| get_container_runtime, | ||
| list_container_runtimes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| list_container_runtimes, | |
| list_training_runtimes, |
What this PR does / why we need it:
This PR introduces a unified ContainerBackend that automatically detects and uses either Docker or Podman for local training execution. This replaces the previous separate LocalDockerBackend and LocalPodmanBackend implementations with a single, cleaner abstraction. You can see the Docker and Podman implementations in separate commits.
This implementation tries Docker first, then falls back to Podman if Docker is unavailable. This can be overridden via ContainerBackendConfig.runtime to force a specific runtime ("docker" or "podman"). An error is raised if neither runtime is available.
Unit tests for the backend implementation have also been added. Examples for using Docker and Podman will be added to the Trainer repo later.
Manually testing on Mac I had to specify the container_host like so:
Docker via Colima
container_host=f"unix://{os.path.expanduser('~')}/.colima/default/docker.sock"Podman Desktop
container_host=f"unix://{os.path.expanduser('~')}/.local/share/containers/podman/machine/podman.sock"Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes ##114 and #108
Checklist:
I need to look at adding docs. A README has been included.