-
Notifications
You must be signed in to change notification settings - Fork 41
Description
What you would like to be added?
As part of initial Kubeflow Training V2 SDK implementation, we introduced label which identifies what accelerators are used by Training Runtime: kubeflow/trainer#2324 (comment).
If Training Runtime has the training.kubeflow.org/accelerator: GPU-Tesla-V100-16GB label, we add this value in the Runtime class. Additionally, we get number of CPU, GPU, or TPU devices from the TrainJob's containers and insert this value into the TrainJob's Component class.
However, it conflicts with other Kubernetes primitives (e.g. nodeSelectors, tolerations, etc.) and Kueue configurations like resources flavours.
We should discuss what is the right way to explain users available hardware resources when they are using Training Runtimes.
cc @franciscojavierarceo @kubeflow/wg-training-leads @astefanutti @Electronic-Waste @seanlaii @kannon92
Why is this needed?
Data Scientists and ML Engineers should understand which accelerators are available for them while using the Training Runtimes.
In the future, we could potentially use these values to automatically assign model and data tensors to the appropriate harware devices while using the Kubeflow Training SDK.
Love this feature?
Give it a 👍 We prioritize the features with most 👍