feat: Add ContainerBackend with Docker and Podman #119

Fiona-Waters · 2025-10-07T15:03:29Z

What this PR does / why we need it:

This PR introduces a unified ContainerBackend that automatically detects and uses either Docker or Podman for local training execution. This replaces the previous separate LocalDockerBackend and LocalPodmanBackend implementations with a single, cleaner abstraction. You can see the Docker and Podman implementations in separate commits.

This implementation tries Docker first, then falls back to Podman if Docker is unavailable. This can be overridden via ContainerBackendConfig.runtime to force a specific runtime ("docker" or "podman"). An error is raised if neither runtime is available.
Unit tests for the backend implementation have also been added. Examples for using Docker and Podman will be added to the Trainer repo later.

Manually testing on Mac I had to specify the container_host like so:
Docker via Colima container_host=f"unix://{os.path.expanduser('~')}/.colima/default/docker.sock"
Podman Desktop container_host=f"unix://{os.path.expanduser('~')}/.local/share/containers/podman/machine/podman.sock"

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes ##114 and #108

Checklist:
I need to look at adding docs. A README has been included.

Docs included if any changes are user facing

Signed-off-by: Brian Gallagher <briangal@gmail.com>

Signed-off-by: Fiona Waters <fiwaters6@gmail.com>

google-oss-prow · 2025-10-07T15:03:35Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

andreyvelich

Thank you for this @Fiona-Waters!
As we discussed here: #111 (comment), can we consolidate Podman and Docker under single container backend ?
Given that those backend should have similar APIs, I think it would be better to consolidate them, similar to KFP: https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/execute-kfp-pipelines-locally/#runner-dockerrunner

Fiona-Waters · 2025-10-08T08:13:30Z

Thank you for this @Fiona-Waters! As we discussed here: #111 (comment), can we consolidate Podman and Docker under single container backend ? Given that those backend should have similar APIs, I think it would be better to consolidate them, similar to KFP: https://www.kubeflow.org/docs/components/pipelines/user-guides/core-functions/execute-kfp-pipelines-locally/#runner-dockerrunner

Thanks @andreyvelich I will look at updating the implementation.

Fiona-Waters · 2025-10-08T10:19:46Z

@andreyvelich @astefanutti regarding comments on this PR and #111 this is what I propose:

We have 3 backends:

Kubernetes
Subprocess
Local Container

For the Local Container backend we automatically try Docker first, then Podman and then fallback to Subprocess if neither runtime is available. We use the adapter pattern with a container client adapter unified interface, and docker and podman specific calls are implemented in separate adapter classes.
There could also be an option where users can force a specific runtime for example:
LocalContainerBackendConfig(runtime="docker")
This implementation will make it easy to add support for other container runtimes in the future, if thats a possibility.
Please let me know what you think. Thanks!
cc @briangallagher

andreyvelich · 2025-10-08T13:41:03Z

Sure, that looks great @Fiona-Waters!

fallback to Subprocess if neither runtime is available

Why do we need to fallback to subprocess ?
I would imagine we have 3 backend support, and user decide what they want to use:

KubernetesBackend()
ContainerBackend()
LocalProcessBackend()

In the ContainerBackend users can select:

 ContainerBackend(
  ContainerBackendConfig(container_runtime="docker")
)
or
 ContainerBackend(
  ContainerBackendConfig(container_runtime="podman")
)

astefanutti · 2025-10-08T14:07:34Z

@Fiona-Waters that sounds to me. I agree the fallback logic may really apply to choose the default container runtime.

Other than that, I'd be inclined to drop the "Local" prefix entirely. Even Kubernetes could run local with KinD, and I doubt the SDK will ever do remote process.

Fiona-Waters · 2025-10-08T14:18:37Z

Sure, that looks great @Fiona-Waters!

fallback to Subprocess if neither runtime is available

Why do we need to fallback to subprocess ? I would imagine we have 3 backend support, and user decide what they want to use:
KubernetesBackend()
ContainerBackend()
LocalProcessBackend()
In the ContainerBackend users can select:
 ContainerBackend(
  ContainerBackendConfig(container_runtime="docker")
)
or
 ContainerBackend(
  ContainerBackendConfig(container_runtime="podman")
)

Understood. Let me see what I can do. Thank you for the swift reply!

Fiona-Waters · 2025-10-08T14:19:09Z

@Fiona-Waters that sounds to me. I agree the fallback logic may really apply to choose the default container runtime.

Other than that, I'd be inclined to drop the "Local" prefix entirely. Even Kubernetes could run local with KinD, and I doubt the SDK will ever do remote process.

Ok cool. Let me see what I can do. Thank you!

Signed-off-by: Fiona Waters <fiwaters6@gmail.com>

Fiona-Waters · 2025-10-10T15:59:19Z

@andreyvelich @astefanutti @briangallagher
I've updated the PR. Please take a look. Thanks

Signed-off-by: Fiona Waters <fiwaters6@gmail.com>

astefanutti · 2025-10-15T12:25:11Z

/ok-to-test

astefanutti · 2025-10-15T13:28:11Z

kubeflow/trainer/backends/container/backend.py

+                            )
+
+            # Store job in backend
+            self._jobs[job_name] = _Job(


Would that be possible to avoid relying on that in-memory "registry" and consistently rely on the state from the container runtime itself?

I've updated to store metadata as labels on containers and networks allowing us to query the container runtime for all job information. See this commit - please let me know what you think. Thanks

… memory Signed-off-by: Fiona Waters <fiwaters6@gmail.com>

astefanutti

@Fiona-Waters thanks for this awesome work!

That looks good to me overall.

/assign @kubeflow/kubeflow-sdk-team @briangallagher

astefanutti · 2025-10-20T16:40:02Z

kubeflow/trainer/backends/local_runtime_loader.py

+
+from kubeflow.trainer.types import types as base_types
+
+LOCAL_RUNTIMES_DIR = Path(__file__).parents[1] / "config" / "local_runtimes"


Maybe:

Suggested change

LOCAL_RUNTIMES_DIR = Path(__file__).parents[1] / "config" / "local_runtimes"

LOCAL_RUNTIMES_DIR = Path(__file__).parents[1] / "config" / "container_runtimes"

otherwise could be confusing for local process backend?

andreyvelich

Thank you @Fiona-Waters!
I left my initial messages.

andreyvelich · 2025-10-20T16:53:36Z

README.md

 print("\n".join(TrainerClient().get_job_logs(name=job_id)))
 ```

+## Local Development


Can we also include docs about Trainer local execution to the user guides ?
https://www.kubeflow.org/docs/components/trainer/user-guides/
you can also add info from the @szaher PR: #95

WIP PR for this kubeflow/website#4221

andreyvelich · 2025-10-20T16:54:49Z

kubeflow/trainer/backends/container/README.md

@@ -0,0 +1,162 @@
+# ContainerBackend


I would suggest that we move this docs to the user guides for now: https://www.kubeflow.org/docs/components/trainer/user-guides/ as we discussed here: #95 (comment)

andreyvelich · 2025-10-20T16:59:21Z

kubeflow/trainer/backends/container/backend.py

+        if self.cfg.runtime:
+            # User specified a runtime explicitly
+            if self.cfg.runtime == "docker":
+                adapter = DockerClientAdapter(self.cfg.container_host)
+                adapter.ping()
+                logger.info("Using Docker as container runtime")
+                return adapter
+            elif self.cfg.runtime == "podman":
+                adapter = PodmanClientAdapter(self.cfg.container_host)
+                adapter.ping()
+                logger.info("Using Podman as container runtime")
+                return adapter
+        else:
+            # Auto-detect: try Docker first, then Podman
+            try:
+                adapter = DockerClientAdapter(self.cfg.container_host)
+                adapter.ping()
+                logger.info("Using Docker as container runtime")
+                return adapter
+            except Exception as docker_error:
+                logger.debug(f"Docker initialization failed: {docker_error}")
+                try:
+                    adapter = PodmanClientAdapter(self.cfg.container_host)
+                    adapter.ping()
+                    logger.info("Using Podman as container runtime")
+                    return adapter
+                except Exception as podman_error:
+                    logger.debug(f"Podman initialization failed: {podman_error}")
+                    raise RuntimeError(
+                        "Neither Docker nor Podman is available. "
+                        "Please install Docker or Podman, or use LocalProcessBackendConfig instead."
+                    ) from podman_error


I think, this can be simplified as follows:

runtime_map = { "docker": DockerClientAdapter, "podman": PodmanClientAdapter, } def get_adapter(cfg): runtimes_to_try = [cfg.runtime] if cfg.runtime else ["docker", "podman"] last_error = None for runtime_name in runtimes_to_try: if runtime_name not in runtime_map: continue try: adapter = runtime_map[runtime_name](cfg.container_host) adapter.ping() logger.info(f"Using {runtime_name} as container runtime") return adapter except Exception as e: logger.debug(f"{runtime_name} initialization failed: {e}") last_error = e raise RuntimeError( "Neither Docker nor Podman is available. " "Please install Docker or Podman, or use LocalProcessBackendConfig instead." ) from last_error

andreyvelich · 2025-10-20T17:06:07Z

kubeflow/trainer/backends/container/backend.py

+        runtime: types.Runtime | None = None,
+        initializer: types.Initializer | None = None,
+        trainer: types.CustomTrainer | types.BuiltinTrainer | None = None,


Please don't use | since we still support Python 3.9 for now. Let's be consistent across backends:

sdk/kubeflow/trainer/backends/kubernetes/backend.py

Line 183 in 24f0bbd

runtime: Optional[types.Runtime] = None,

andreyvelich · 2025-10-20T17:06:47Z

kubeflow/trainer/config/container_runtimes/torch_distributed.yaml

@@ -0,0 +1,25 @@
+apiVersion: trainer.kubeflow.org/v1alpha1


Instead of installing the runtimes, can we just read the image version from GitHub dynamically ?

Let me look into that. For offline support should we fall back to providing this?

andreyvelich · 2025-10-20T17:14:07Z

kubeflow/trainer/backends/container/types.py

+    image: Optional[str] = Field(default=None)
+    pull_policy: str = Field(default="IfNotPresent")
+    auto_remove: bool = Field(default=True)
+    gpus: Optional[Union[int, bool]] = Field(default=None)


How this is used ?

It allows you to override the default container image specified in the ClusterTrainingRuntime.

I think actually you are referring to gpus. It was pulled over from a previous iteration but isn't being used currently. I can update to include GPU support.

I'm a bit confused by this as there can be multiple ClusterTraininingRuntimes and it's defined by TrainJobs.

Apologies I understand now that it doesn't make sense to allow the image to be updated here as it is not per job. Will remove this. Thank you.

andreyvelich · 2025-10-20T17:14:58Z

kubeflow/trainer/backends/container/types.py

+    gpus: Optional[Union[int, bool]] = Field(default=None)
+    env: Optional[dict[str, str]] = Field(default=None)
+    container_host: Optional[str] = Field(default=None)
+    workdir_base: Optional[str] = Field(default=None)


Can we initially use the default dir and if users require to configure it we will give them such option ?

I can remove it for now and use the default dir.

andreyvelich · 2025-10-20T17:15:33Z

kubeflow/trainer/backends/container/types.py

+    pull_policy: str = Field(default="IfNotPresent")
+    auto_remove: bool = Field(default=True)
+    gpus: Optional[Union[int, bool]] = Field(default=None)
+    env: Optional[dict[str, str]] = Field(default=None)


Do we start a new container every time Job is submitted ?
If yes, this might be controlled via train() API.

Yes, you're right - similar to the image param. Will remove this. Thanks

andreyvelich · 2025-10-20T17:16:04Z

kubeflow/trainer/backends/container/types.py

+    env: Optional[dict[str, str]] = Field(default=None)
+    container_host: Optional[str] = Field(default=None)
+    workdir_base: Optional[str] = Field(default=None)
+    runtime: Optional[Literal["docker", "podman"]] = Field(default=None)


to make it less confusing with TrainingRuntime, can we name it:

Suggested change

runtime: Optional[Literal["docker", "podman"]] = Field(default=None)

container_runtime: Optional[Literal["docker", "podman"]] = Field(default="docker")

andreyvelich · 2025-10-20T17:21:12Z

kubeflow/trainer/backends/container/client_adapter.py

+from collections.abc import Iterator
+
+
+class ContainerClientAdapter(abc.ABC):


Can we call it as BaseContainerClientAdapter(), similar to:

sdk/kubeflow/trainer/backends/base.py

Line 23 in 24f0bbd

class ExecutionBackend(abc.ABC):

I would suggest, we move them to subdirectory:

container/adapters/base.py container/adapters/docker.py container/adapters/podman.py

WDYT @Fiona-Waters ?

Yes good idea, will do.
BTW thank you for your review. Will update the PR tomorrow hopefully.

Fiona-Waters · 2025-10-23T15:38:08Z

@andreyvelich I have addressed all of your comments, please review again when you can. I have removed the README.md and will add it along with docs on local execution to the user guides.
@astefanutti could you please review again.
Thank you both.

Signed-off-by: Fiona Waters <fiwaters6@gmail.com>

astefanutti · 2025-10-24T15:03:32Z

kubeflow/trainer/backends/container/utils.py

+    """
+    Create per-job working directory on host.
+
+    Working directories are created under ~/.kubeflow_trainer/localcontainer/<job_name>


Maybe ~/.kubeflow/trainer/containers/... ?

astefanutti · 2025-10-24T15:04:21Z

kubeflow/trainer/backends/container_runtime_loader.py

+
+logger = logging.getLogger(__name__)
+
+CONTAINER_RUNTIMES_DIR = Path(__file__).parents[1] / "config" / "container_runtimes"


I think I've suggested "container_runtimes" before, but looking at it maybe training_runtimes would be more appropriate?

astefanutti · 2025-10-24T15:05:03Z

kubeflow/trainer/backends/container_runtime_loader.py

+logger = logging.getLogger(__name__)
+
+CONTAINER_RUNTIMES_DIR = Path(__file__).parents[1] / "config" / "container_runtimes"
+CACHE_DIR = Path.home() / ".kubeflow_trainer" / "runtime_cache"


Maybe ~/.kubeflow/trainer/cache?

astefanutti · 2025-10-24T15:06:16Z

kubeflow/trainer/backends/container_runtime_loader.py

+
+# GitHub runtimes configuration
+GITHUB_RUNTIMES_BASE_URL = (
+    "https://raw.githubusercontent.com/kubeflow/trainer/master/manifests/base/runtimes"


Note for later, we should probably find a way rely on released runtimes.

astefanutti · 2025-10-24T15:09:27Z

kubeflow/trainer/backends/container_runtime_loader.py

@@ -0,0 +1,417 @@
+# Copyright 2025 The Kubeflow Authors.


training_runtime_loader.py?

astefanutti · 2025-10-24T15:16:27Z

kubeflow/trainer/backends/container/types.py

+    gpus: Optional[Union[int, bool]] = Field(default=None)
+    container_host: Optional[str] = Field(default=None)
+    container_runtime: Optional[Literal["docker", "podman"]] = Field(default=None)
+    use_github_runtimes: bool = Field(default=True)


Maybe we can be more structured here and have a training_runtimes argument, that could have different options, like one to some URLs maybe.

astefanutti · 2025-10-24T15:16:45Z

kubeflow/trainer/backends/container/runtime_loader.py

+"""
+
+from kubeflow.trainer.backends.container_runtime_loader import (
+    CONTAINER_RUNTIMES_DIR,


Suggested change

CONTAINER_RUNTIMES_DIR,

TRAINING_RUNTIMES_DIR,

astefanutti · 2025-10-24T15:17:11Z

kubeflow/trainer/backends/container/runtime_loader.py

+
+from kubeflow.trainer.backends.container_runtime_loader import (
+    CONTAINER_RUNTIMES_DIR,
+    get_container_runtime,


Suggested change

get_container_runtime,

get_training_runtime,

astefanutti · 2025-10-24T15:17:23Z

kubeflow/trainer/backends/container/runtime_loader.py

+from kubeflow.trainer.backends.container_runtime_loader import (
+    CONTAINER_RUNTIMES_DIR,
+    get_container_runtime,
+    list_container_runtimes,


Suggested change

list_container_runtimes,

list_training_runtimes,

briangallagher and others added 2 commits October 7, 2025 15:53

Add docker backend

1188510

Signed-off-by: Brian Gallagher <briangal@gmail.com>

Add podman backend

60fd258

Signed-off-by: Fiona Waters <fiwaters6@gmail.com>

google-oss-prow bot requested review from kramaranya and szaher October 7, 2025 15:03

google-oss-prow bot added the size/XXL label Oct 7, 2025

andreyvelich reviewed Oct 7, 2025

View reviewed changes

Fiona-Waters changed the title ~~feat: Add Podman backend and sync Docker backend implementation~~ [WIP] feat: Add Podman backend and sync Docker backend implementation Oct 8, 2025

google-oss-prow bot added the do-not-merge/work-in-progress label Oct 8, 2025

briangallagher mentioned this pull request Oct 8, 2025

feat: Add docker backend #111

Closed

Fiona-Waters force-pushed the podman-backend branch 3 times, most recently from 1f7c066 to bdde877 Compare October 10, 2025 15:48

Implementing ContainerBackend

d1288b2

Signed-off-by: Fiona Waters <fiwaters6@gmail.com>

Fiona-Waters force-pushed the podman-backend branch from bdde877 to d1288b2 Compare October 10, 2025 15:55

Fiona-Waters changed the title ~~[WIP] feat: Add Podman backend and sync Docker backend implementation~~ feat: Add Podman backend and sync Docker backend implementation Oct 10, 2025

google-oss-prow bot removed the do-not-merge/work-in-progress label Oct 10, 2025

Fiona-Waters requested a review from andreyvelich October 10, 2025 15:58

Fiona-Waters changed the title ~~feat: Add Podman backend and sync Docker backend implementation~~ feat: Add ContainerBackend with Docker and Podman Oct 10, 2025

Use ip address for Podman

0680c3e

Signed-off-by: Fiona Waters <fiwaters6@gmail.com>

google-oss-prow bot added the ok-to-test label Oct 15, 2025

astefanutti reviewed Oct 15, 2025

View reviewed changes

Updating to rely on container runtime rather than storing job info in…

24f0bbd

… memory Signed-off-by: Fiona Waters <fiwaters6@gmail.com>

Fiona-Waters force-pushed the podman-backend branch from defd7eb to 24f0bbd Compare October 20, 2025 10:40

astefanutti reviewed Oct 20, 2025

View reviewed changes

andreyvelich reviewed Oct 20, 2025

View reviewed changes

briangallagher mentioned this pull request Oct 21, 2025

feat: add local docker training example kubeflow/trainer#2869

Open

astefanutti mentioned this pull request Oct 22, 2025

Kubeflow SDK v0.2 Release #120

Open

6 tasks

Fiona-Waters force-pushed the podman-backend branch from 4e195ba to 34a4d10 Compare October 23, 2025 15:36

Fiona-Waters mentioned this pull request Oct 24, 2025

[WIP] Trainer: Adding local mode guides kubeflow/website#4221

Open

3 tasks

Addressing feedback

a9218c5

Signed-off-by: Fiona Waters <fiwaters6@gmail.com>

Fiona-Waters force-pushed the podman-backend branch from 34a4d10 to a9218c5 Compare October 24, 2025 14:29

astefanutti reviewed Oct 24, 2025

View reviewed changes


		from kubeflow.trainer.types import types as base_types

		LOCAL_RUNTIMES_DIR = Path(__file__).parents[1] / "config" / "local_runtimes"

	runtime: Optional[Literal["docker", "podman"]] = Field(default=None)
	container_runtime: Optional[Literal["docker", "podman"]] = Field(default="docker")

		from collections.abc import Iterator


		class ContainerClientAdapter(abc.ABC):


		logger = logging.getLogger(__name__)

		CONTAINER_RUNTIMES_DIR = Path(__file__).parents[1] / "config" / "container_runtimes"

feat: Add ContainerBackend with Docker and Podman #119

Are you sure you want to change the base?

feat: Add ContainerBackend with Docker and Podman #119

Conversation

Fiona-Waters commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Oct 7, 2025

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Fiona-Waters commented Oct 8, 2025

Uh oh!

Fiona-Waters commented Oct 8, 2025

Uh oh!

andreyvelich commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

astefanutti commented Oct 8, 2025

Uh oh!

Fiona-Waters commented Oct 8, 2025

Uh oh!

Fiona-Waters commented Oct 8, 2025

Uh oh!

Fiona-Waters commented Oct 10, 2025

Uh oh!

astefanutti commented Oct 15, 2025

Uh oh!

astefanutti Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Fiona-Waters commented Oct 7, 2025 •

edited

Loading

andreyvelich commented Oct 8, 2025 •

edited

Loading

astefanutti Oct 15, 2025 •

edited

Loading