Specifying Instance Type Instead Of Resource Requests #6376

swarup-stripe · 2025-03-26T14:10:44Z

swarup-stripe
Mar 26, 2025

Motivation

Currently, users are completely unaware about what instance types are available in the cluster Flyte is deployed to. Obliviously, they specify cpu, memory, gpu and ephemeral_storage requests and limits in their task definitions and get assigned to any matching instance type. However, this creates a lot of variability and ambiguity to what instance a task will schedule on. This creates issues like:

Low performance: For some workloads like long running training jobs, the specific instance type matters a lot. For example, if I specify 1 GPU, getting 1 V100 GPU will give entirely different performance from getting 1 A10G GPU. We've seen this happen for CPUs as well where different architectures can lead to different performance.
Unschedulable pods: It's incredibly easy for a user to specify resource requests/limits that can't fit into a single node. Especially due to things like daemonset overheads, it's hard for a user to calculate and validate themselves. Oftentimes, we end up with invalid resource specs that result in the pods being unschedulable and stuck in Pending.

Workarounds

We currently have a wrapper class to define each instance type that our cluster supports:

@dataclass
class Resources(DataClassJSONMixin):
    name: str
    cpu: int
    mem: int
    gpu: Optional[int] = None
    ephemeral_storage: int = field(default=1000)

with some presets defined (note that we're on AWS):

M5D_XLARGE: Resources = Resources(
    name="m5d.xlarge", cpu=3900, mem=14162, gpu=None, ephemeral_storage=1000
)

And this class has some nice features like implementing __truediv__ so the presets can be divided. But to use this we have to turn it back into the raw resources like @task(requests=DEFAULT.flyte_resource).

We would also like to abstract out pod_template in this - we heavily use node selectors, taints and tolerations to select instance families and it would be nice to include this in the presets.

Proposal

Can we have a class like Resources baked into flytekit, so we can do something like @task(resources=M5D_XLARGE) and this will abstract out things like requests, limits, pod_template and be overridable? We don't need to be so prescriptive as to define presets in flytekit itself since this is cloud provider specific but having an interface like this exposed to platform owners would be incredibly useful.

fg91 · 2025-04-10T17:53:39Z

fg91
Apr 10, 2025
Collaborator

We do something very similar internally (running on GKE). Our ML engineers only select from instances types we in the platform team created node pools for, they don't specify requests/limits etc. themselves. (We have an internal wrapper decorator around @task that does, among other things, the translation to requests/limits/pod template.)

I very much agree with your point that maintaining cloud provider specific machine shapes within flytekit itself is probably not a good idea but that having an interface in flytekit that platform teams can make use of would be very beneficial.

I also could see the community maintaining cloud provider specific machine shapes in flytekit plugins.

0 replies

davidmirror-ops · 2025-04-24T14:48:31Z

davidmirror-ops
Apr 24, 2025
Maintainer

Hey @swarup-stripe thanks for laying out this idea.

We discussed this in the contributor's meetup, considering the experience from @fg91 and other organizations.

Fabio mentioned that in their approach, they rely heavily on the node autoscaling capabilities of GKE. Since they don't want idle GPU nodes, they scale up nodes on-demand when a job needs the resources, and quickly scale down nodes when they are no longer needed.

This contrasts with the approach @EngHabu described at another company, where they have a more proactive strategy. They analyze the job history to determine the optimal machine types to provision, and then use Karpenter to pack as many jobs as possible onto those pre-provisioned nodes. He noted this proposal could complicate that approach a bit more.

The key difference is the reactive vs. proactive approach to autoscaling. Fabio's company scales on-demand, while the other company tries to predict the optimal machine types ahead of time and then packs jobs onto those pre-provisioned nodes.

The proactive approach gives users less flexibility to choose machine types, but may lead to better overall resource utilization if the predictions are accurate. The reactive approach maintains more flexibility, but risks having idle resources if the autoscaling can't keep up with demand.

So the choice between these approaches depends on factors like the predictability of the workload, the importance of user flexibility, and the overall goal of maximizing resource efficiency.

How does this work at your organization @swarup-stripe?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specifying Instance Type Instead Of Resource Requests #6376

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Specifying Instance Type Instead Of Resource Requests #6376

Uh oh!

swarup-stripe Mar 26, 2025

Motivation

Workarounds

Proposal

Replies: 2 comments

Uh oh!

fg91 Apr 10, 2025 Collaborator

Uh oh!

davidmirror-ops Apr 24, 2025 Maintainer

swarup-stripe
Mar 26, 2025

fg91
Apr 10, 2025
Collaborator

davidmirror-ops
Apr 24, 2025
Maintainer