Specifying Instance Type Instead Of Resource Requests #6376
Replies: 2 comments
-
We do something very similar internally (running on GKE). Our ML engineers only select from instances types we in the platform team created node pools for, they don't specify requests/limits etc. themselves. (We have an internal wrapper decorator around I very much agree with your point that maintaining cloud provider specific machine shapes within flytekit itself is probably not a good idea but that having an interface in flytekit that platform teams can make use of would be very beneficial. I also could see the community maintaining cloud provider specific machine shapes in flytekit plugins. |
Beta Was this translation helpful? Give feedback.
-
Hey @swarup-stripe thanks for laying out this idea. We discussed this in the contributor's meetup, considering the experience from @fg91 and other organizations. Fabio mentioned that in their approach, they rely heavily on the node autoscaling capabilities of GKE. Since they don't want idle GPU nodes, they scale up nodes on-demand when a job needs the resources, and quickly scale down nodes when they are no longer needed. This contrasts with the approach @EngHabu described at another company, where they have a more proactive strategy. They analyze the job history to determine the optimal machine types to provision, and then use Karpenter to pack as many jobs as possible onto those pre-provisioned nodes. He noted this proposal could complicate that approach a bit more. The key difference is the reactive vs. proactive approach to autoscaling. Fabio's company scales on-demand, while the other company tries to predict the optimal machine types ahead of time and then packs jobs onto those pre-provisioned nodes. The proactive approach gives users less flexibility to choose machine types, but may lead to better overall resource utilization if the predictions are accurate. The reactive approach maintains more flexibility, but risks having idle resources if the autoscaling can't keep up with demand. So the choice between these approaches depends on factors like the predictability of the workload, the importance of user flexibility, and the overall goal of maximizing resource efficiency. How does this work at your organization @swarup-stripe? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
Currently, users are completely unaware about what instance types are available in the cluster Flyte is deployed to. Obliviously, they specify
cpu
,memory
,gpu
andephemeral_storage
requests and limits in their task definitions and get assigned to any matching instance type. However, this creates a lot of variability and ambiguity to what instance a task will schedule on. This creates issues like:Workarounds
We currently have a wrapper class to define each instance type that our cluster supports:
with some presets defined (note that we're on AWS):
And this class has some nice features like implementing
__truediv__
so the presets can be divided. But to use this we have to turn it back into the raw resources like@task(requests=DEFAULT.flyte_resource)
.We would also like to abstract out
pod_template
in this - we heavily use node selectors, taints and tolerations to select instance families and it would be nice to include this in the presets.Proposal
Can we have a class like
Resources
baked into flytekit, so we can do something like@task(resources=M5D_XLARGE)
and this will abstract out things likerequests
,limits
,pod_template
and be overridable? We don't need to be so prescriptive as to define presets in flytekit itself since this is cloud provider specific but having an interface like this exposed to platform owners would be incredibly useful.Beta Was this translation helpful? Give feedback.
All reactions