Skip to content

[SDK] Show available Runtime accelerators to users #21

@andreyvelich

Description

@andreyvelich

What you would like to be added?

As part of initial Kubeflow Training V2 SDK implementation, we introduced label which identifies what accelerators are used by Training Runtime: kubeflow/trainer#2324 (comment).

If Training Runtime has the training.kubeflow.org/accelerator: GPU-Tesla-V100-16GB label, we add this value in the Runtime class. Additionally, we get number of CPU, GPU, or TPU devices from the TrainJob's containers and insert this value into the TrainJob's Component class.

However, it conflicts with other Kubernetes primitives (e.g. nodeSelectors, tolerations, etc.) and Kueue configurations like resources flavours.

We should discuss what is the right way to explain users available hardware resources when they are using Training Runtimes.

cc @franciscojavierarceo @kubeflow/wg-training-leads @astefanutti @Electronic-Waste @seanlaii @kannon92

Why is this needed?

Data Scientists and ML Engineers should understand which accelerators are available for them while using the Training Runtimes.

In the future, we could potentially use these values to automatically assign model and data tensors to the appropriate harware devices while using the Kubeflow Training SDK.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions