Skip to content

Training Options for TrainJob customization #92

@kramaranya

Description

@kramaranya

What you would like to be added?

We’d like to propose an “options” style API for the Kubeflow Trainer SDK that lets users customize a TrainJob without surfacing the full Kubernetes surface and writing YAML files.
The idea is to add an optional options parameter to train() inspired by golang style package options where each option is a small callable that mutates the TrainJob before submit.
This gives data scientists a simple way to do common tasks like adding labels and annotations, enabling Kueue, attaching per‑Pod template overrides:

from kubeflow.trainer import TrainerClient, CustomTrainer
from kubeflow.trainer.options import (
    WithLabels,
    WithAnnotations,
    WithKueue,
    WithPodSpecOverrides,
)

def MyCustomOption(service_account: str):
    def apply(job: dict):
        spec = job.setdefault("spec", {})
        spec.setdefault("podSpecOverrides", []).append({"serviceAccountName": service_account})
    return apply

job_id = TrainerClient().train(
    runtime=runtime,
    trainer=CustomTrainer(
        func=train,
        packages_to_install=["transformers", "boto3"],
    ),
    options=[
        WithLabels({"project": "training", "owner": "platform-team"}),
        WithAnnotations({"version": "v1"}),
        WithKueue(
            queue="ml-queue",
            topology_labels={"kueue.x-k8s.io/podset-name": "node"},
            managed_by="kueue.x-k8s.io/multikueue",
        ),
        WithPodSpecOverrides([
            {
                "targetJobs": ["node"],
                "labels": {"workload": "training"},
                "containers": [
                    {"name": "node", "volumeMounts": [{"name": "data", "mountPath": "/workspace/data"}]}
                ],
                "volumes": [
                    {"name": "data", "persistentVolumeClaim": {"claimName": "datasets-pvc"}}
                ],
            }
        ]),
        MyCustomOption("trainer-sa"),
    ],
)

Why is this needed?

This addresses the current gaps we have such as setting labels/annotations from the SDK, using Kueue, and mounting volumes/volumeMounts.
The API would stay the same by default; if a user doesn’t pass options, behavior is unchanged.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions