Skip to content

Conversation

@andreyvelich
Copy link
Member

Fixes: #31, #29

I refactored SDK to support runtime labels and give users ability to override the default runtime images.

/assign @Electronic-Waste @astefanutti @kramaranya @jskswamy @ned2 @eoinfennessy @briangallagher

@google-oss-prow
Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: jskswamy, ned2, eoinfennessy, briangallagher.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

Fixes: #31, #29

I refactored SDK to support runtime labels and give users ability to override the default runtime images.

/assign @Electronic-Waste @astefanutti @kramaranya @jskswamy @ned2 @eoinfennessy @briangallagher

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Comment on lines +170 to +179
__command: tuple[str, ...] = field(init=False, repr=False)

@property
def command(self) -> tuple[str, ...]:
return self.__command

def set_command(self, command: tuple[str, ...]):
self.__command = command
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make this __command private, so it should not be accessed or modified by users.

Comment on lines +129 to +136
EXEC_FUNC_SCRIPT = textwrap.dedent(
"""
read -r -d '' SCRIPT << EOM\n
{func_code}
EOM
printf "%s" \"$SCRIPT\" > \"{func_file}\"
__ENTRYPOINT__ \"{func_file}\""""
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For runtimes with CustomTrainer, I insert the complete command when Runtime object is created:

# Set the Trainer entrypoint.
if framework == types.TORCH_TUNE:
trainer.set_command(constants.TORCHTUNE_COMMAND)
elif ml_policy.torch:
trainer.set_command(constants.TORCH_COMMAND)
elif ml_policy.mpi:
trainer.set_command(constants.MPI_COMMAND)
else:
trainer.set_command(constants.DEFAULT_COMMAND)

I think, this is much clear approach since we always have the final container command in runtime.trainer.__command.

# The dict where key is the container image and value its representation.
# Each Trainer representation defines trainer parameters (e.g. type, framework, entrypoint).
# TODO (andreyvelich): We should allow user to overrides the default image names.
ALL_TRAINERS: Dict[str, RuntimeTrainer] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would keeping the existing mapping, but with the framework as key instead of the image, make things tidier / more structured rather than replacing that with individual strings?

Copy link
Member Author

@andreyvelich andreyvelich Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't need it, do we ? Theoretically, users should be able to use any arbitrary framework they define in the runtime labels with CustomTrainer, even that we don't support upstream (e.g. scikit-learn, tf, etc.).
The important mapping is only for BuiltinTrainers, since we need to map the framework name and BuiltinTrainer configs:

BuiltinTrainer.__annotations__["config"].__name__.lower().replace("config", "")
.

If you want, we can create mapping for BuiltinTrainer configs, but for now we have only 1 config (e.g. TorchTune).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that makes sense. It's more an internal implementation detail anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should verify that:

  1. CustomTrainer does not reference runtimes corresponding to BuiltinTrainer
  2. BuiltTrainer only reference corresponding runtimes

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for creating this. I left my initial review for you.

Comment on lines +57 to +59
# The label key to identify ML framework that runtime uses (e.g. torch, deepspeed, torchtune, etc.)
RUNTIME_FRAMEWORK_LABEL = "trainer.kubeflow.org/framework"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also define TRAINER_TYPE_LABEL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't really need it as we discussed here: kubeflow/trainer#2761 (comment)

func: Callable
func_args: Optional[Dict] = None
packages_to_install: Optional[List[str]] = None
packages_to_install: Optional[list[str]] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we change from List to list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



# Change it to list: BUILTIN_CONFIGS, once we support more Builtin Trainer configs.
TORCH_TUNE = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer TORCHTUNE or TORCH_TUNE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe TORCH_TUNE is more accurate since we use camel case here: TorchTuneConfig.

Comment on lines 195 to 211
if isinstance(trainer, types.CustomTrainer):
trainer_crd = utils.get_trainer_crd_from_custom_trainer(
trainer, runtime
runtime, trainer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add validation here to ensure that users reference the right runtime, as we discussed here: https://cloud-native.slack.com/archives/C0742LDFZ4K/p1753710956860929

Comment on lines 201 to 221
elif isinstance(trainer, types.BuiltinTrainer):
trainer_crd = utils.get_trainer_crd_from_builtin_trainer(
trainer, initializer
runtime, trainer, initializer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for reminder!

@astefanutti
Copy link
Contributor

astefanutti commented Aug 1, 2025

/lgtm

Thanks!

@google-oss-prow google-oss-prow bot added lgtm and removed lgtm labels Aug 1, 2025
@coveralls
Copy link

coveralls commented Aug 1, 2025

Pull Request Test Coverage Report for Build 16706738665

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 48 of 84 (57.14%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.4%) to 66.038%

Changes Missing Coverage Covered Lines Changed/Added Lines %
python/kubeflow/trainer/api/trainer_client.py 6 11 54.55%
python/kubeflow/trainer/utils/utils.py 42 73 57.53%
Files with Coverage Reduction New Missed Lines %
python/kubeflow/trainer/utils/utils.py 1 58.72%
Totals Coverage Status
Change from base Build 16677657237: 0.4%
Covered Lines: 280
Relevant Lines: 424

💛 - Coveralls

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich andreyvelich force-pushed the runtime-framework-labels branch from d6603e6 to 033357e Compare August 3, 2025 15:46
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Member Author

/retest

Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you!
/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Aug 4, 2025
Copy link
Member Author

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 09a5ea7 into kubeflow:main Aug 4, 2025
8 of 12 checks passed
@andreyvelich andreyvelich deleted the runtime-framework-labels branch August 4, 2025 15:15
@google-oss-prow google-oss-prow bot added this to the v0.1 milestone Aug 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants