feat(trainer): Support Framework Labels in Runtimes #56

andreyvelich · 2025-08-01T13:23:02Z

Fixes: #31, #29

I refactored SDK to support runtime labels and give users ability to override the default runtime images.

/assign @Electronic-Waste @astefanutti @kramaranya @jskswamy @ned2 @eoinfennessy @briangallagher

google-oss-prow · 2025-08-01T13:23:06Z

@andreyvelich: GitHub didn't allow me to assign the following users: jskswamy, ned2, eoinfennessy, briangallagher.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Fixes: #31, #29

I refactored SDK to support runtime labels and give users ability to override the default runtime images.

/assign @Electronic-Waste @astefanutti @kramaranya @jskswamy @ned2 @eoinfennessy @briangallagher

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andreyvelich · 2025-08-01T13:23:57Z

python/kubeflow/trainer/types/types.py

+    __command: tuple[str, ...] = field(init=False, repr=False)
+
+    @property
+    def command(self) -> tuple[str, ...]:
+        return self.__command
+
+    def set_command(self, command: tuple[str, ...]):
+        self.__command = command


I make this __command private, so it should not be accessed or modified by users.

andreyvelich · 2025-08-01T13:28:40Z

python/kubeflow/trainer/constants/constants.py

+EXEC_FUNC_SCRIPT = textwrap.dedent(
+    """
+        read -r -d '' SCRIPT << EOM\n
+        {func_code}
+        EOM
+        printf "%s" \"$SCRIPT\" > \"{func_file}\"
+        __ENTRYPOINT__ \"{func_file}\""""
+)


For runtimes with CustomTrainer, I insert the complete command when Runtime object is created:

sdk/python/kubeflow/trainer/utils/utils.py

Lines 155 to 164 in f3e9020

# Set the Trainer entrypoint.

if framework == types.TORCH_TUNE:

trainer.set_command(constants.TORCHTUNE_COMMAND)

elif ml_policy.torch:

trainer.set_command(constants.TORCH_COMMAND)

elif ml_policy.mpi:

trainer.set_command(constants.MPI_COMMAND)

else:

trainer.set_command(constants.DEFAULT_COMMAND)

I think, this is much clear approach since we always have the final container command in runtime.trainer.__command.

astefanutti · 2025-08-01T13:30:16Z

python/kubeflow/trainer/types/types.py

-# The dict where key is the container image and value its representation.
-# Each Trainer representation defines trainer parameters (e.g. type, framework, entrypoint).
-# TODO (andreyvelich): We should allow user to overrides the default image names.
-ALL_TRAINERS: Dict[str, RuntimeTrainer] = {


Would keeping the existing mapping, but with the framework as key instead of the image, make things tidier / more structured rather than replacing that with individual strings?

But we don't need it, do we ? Theoretically, users should be able to use any arbitrary framework they define in the runtime labels with CustomTrainer, even that we don't support upstream (e.g. scikit-learn, tf, etc.).
The important mapping is only for BuiltinTrainers, since we need to map the framework name and BuiltinTrainer configs:

sdk/python/kubeflow/trainer/types/types.py

Line 154 in f3e9020

BuiltinTrainer.__annotations__["config"].__name__.lower().replace("config", "")

.

If you want, we can create mapping for BuiltinTrainer configs, but for now we have only 1 config (e.g. TorchTune).

OK that makes sense. It's more an internal implementation detail anyway.

We should verify that:

CustomTrainer does not reference runtimes corresponding to BuiltinTrainer

BuiltTrainer only reference corresponding runtimes

Electronic-Waste

@andreyvelich Thanks for creating this. I left my initial review for you.

Electronic-Waste · 2025-08-01T13:34:17Z

python/kubeflow/trainer/constants/constants.py

+# The label key to identify ML framework that runtime uses (e.g. torch, deepspeed, torchtune, etc.)
+RUNTIME_FRAMEWORK_LABEL = "trainer.kubeflow.org/framework"
+


Shall we also define TRAINER_TYPE_LABEL?

No, we don't really need it as we discussed here: kubeflow/trainer#2761 (comment)

Electronic-Waste · 2025-08-01T13:34:44Z

python/kubeflow/trainer/types/types.py

    func: Callable
    func_args: Optional[Dict] = None
-    packages_to_install: Optional[List[str]] = None
+    packages_to_install: Optional[list[str]] = None


Why should we change from List to list?

Python 3.9 introduce builtin support for those types: https://docs.python.org/3/whatsnew/3.9.html#type-hinting-generics-in-standard-collections

Electronic-Waste · 2025-08-01T13:38:25Z

python/kubeflow/trainer/types/types.py



+# Change it to list: BUILTIN_CONFIGS, once we support more Builtin Trainer configs.
+TORCH_TUNE = (


Do you prefer TORCHTUNE or TORCH_TUNE?

Maybe TORCH_TUNE is more accurate since we use camel case here: TorchTuneConfig.

Electronic-Waste · 2025-08-01T13:41:12Z

python/kubeflow/trainer/api/trainer_client.py

            if isinstance(trainer, types.CustomTrainer):
                trainer_crd = utils.get_trainer_crd_from_custom_trainer(
-                    trainer, runtime
+                    runtime, trainer


Can you add validation here to ensure that users reference the right runtime, as we discussed here: https://cloud-native.slack.com/archives/C0742LDFZ4K/p1753710956860929

Electronic-Waste · 2025-08-01T13:41:58Z

python/kubeflow/trainer/api/trainer_client.py

            elif isinstance(trainer, types.BuiltinTrainer):
                trainer_crd = utils.get_trainer_crd_from_builtin_trainer(
-                    trainer, initializer
+                    runtime, trainer, initializer


Same as above

Sure, thanks for reminder!

astefanutti · 2025-08-01T13:48:05Z

/lgtm

Thanks!

coveralls · 2025-08-01T14:35:15Z

Pull Request Test Coverage Report for Build 16706738665

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

48 of 84 (57.14%) changed or added relevant lines in 2 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.4%) to 66.038%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
python/kubeflow/trainer/api/trainer_client.py	6	11	54.55%
python/kubeflow/trainer/utils/utils.py	42	73	57.53%

Files with Coverage Reduction	New Missed Lines	%
python/kubeflow/trainer/utils/utils.py	1	58.72%

Totals
Change from base Build 16677657237:	0.4%
Covered Lines:	280
Relevant Lines:	424

💛 - Coveralls

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich · 2025-08-04T14:31:51Z

/retest

kramaranya

This looks great, thank you!
/lgtm

andreyvelich

/approve

google-oss-prow · 2025-08-04T15:13:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot assigned astefanutti and Electronic-Waste Aug 1, 2025

google-oss-prow bot assigned kramaranya Aug 1, 2025

google-oss-prow bot requested review from Electronic-Waste and astefanutti August 1, 2025 13:23

google-oss-prow bot added the size/XL label Aug 1, 2025

andreyvelich commented Aug 1, 2025

View reviewed changes

astefanutti reviewed Aug 1, 2025

View reviewed changes

Electronic-Waste reviewed Aug 1, 2025

View reviewed changes

google-oss-prow bot added lgtm and removed lgtm labels Aug 1, 2025

andreyvelich mentioned this pull request Aug 1, 2025

fix(examples): Update the argument for Runtime framework kubeflow/trainer#2766

Merged

andreyvelich added 7 commits August 3, 2025 16:45

Refactor SDK to support runtime labels

eb3f39b

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Fix unit tests

24bc629

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Remove legacy types

a3d3b58

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Add warnings if runtime doesn't have a label

84e7bc8

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Change to TORCH_TUNE

6a08639

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Add validation for runtime trainers

4af192c

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Print Runtime name

033357e

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich force-pushed the runtime-framework-labels branch from d6603e6 to 033357e Compare August 3, 2025 15:46

Fix tests

b144f61

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

andreyvelich mentioned this pull request Aug 4, 2025

chore(ci): Add dev tests with master dependencies #55

Merged

1 task

kramaranya reviewed Aug 4, 2025

View reviewed changes

google-oss-prow bot added the lgtm label Aug 4, 2025

andreyvelich commented Aug 4, 2025

View reviewed changes

google-oss-prow bot added the approved label Aug 4, 2025

google-oss-prow bot merged commit 09a5ea7 into kubeflow:main Aug 4, 2025
8 of 12 checks passed

andreyvelich mentioned this pull request Aug 4, 2025

Support custom Docker images for MPI training runtimes #29

Closed

andreyvelich deleted the runtime-framework-labels branch August 4, 2025 15:15

google-oss-prow bot added this to the v0.1 milestone Aug 4, 2025

	# Set the Trainer entrypoint.
	if framework == types.TORCH_TUNE:
	trainer.set_command(constants.TORCHTUNE_COMMAND)
	elif ml_policy.torch:
	trainer.set_command(constants.TORCH_COMMAND)
	elif ml_policy.mpi:
	trainer.set_command(constants.MPI_COMMAND)
	else:
	trainer.set_command(constants.DEFAULT_COMMAND)

		# The label key to identify ML framework that runtime uses (e.g. torch, deepspeed, torchtune, etc.)
		RUNTIME_FRAMEWORK_LABEL = "trainer.kubeflow.org/framework"



		# Change it to list: BUILTIN_CONFIGS, once we support more Builtin Trainer configs.
		TORCH_TUNE = (

feat(trainer): Support Framework Labels in Runtimes #56

feat(trainer): Support Framework Labels in Runtimes #56

Uh oh!

Conversation

andreyvelich commented Aug 1, 2025

Uh oh!

google-oss-prow bot commented Aug 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Electronic-Waste left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16706738665

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

andreyvelich commented Aug 4, 2025

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Aug 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andreyvelich Aug 1, 2025 •

edited

Loading

astefanutti commented Aug 1, 2025 •

edited

Loading

coveralls commented Aug 1, 2025 •

edited

Loading