Handling Long/Short Workloads in `AdmissionFairSharing` #5609

kimminw00 · 2025-06-10T17:06:47Z

kimminw00
Jun 10, 2025

We're currently exploring the use of AdmissionFairSharing to enable fair resource sharing across teams using separate LocalQueues but targeting a shared ClusterQueue. The mechanism works well in principle, especially with configurations like:

admissionFairSharing:
  usageHalfLifeTime: "168h"         # decay over 1 week
  usageSamplingInterval: "5m"       # sampled every 5 minutes
  resourceWeights:
    cpu: 1.0
    memory: 1.0
    nvidia.com/gpu: 2.0

However, we’re concerned about possible workload behaviors that might undermine fair sharing without triggering preemption, such as:

Long-running jobs that exceed the usageHalfLifeTime (e.g., 30 days) and monopolize resources, possibly delaying other workloads indefinitely. (Do long-running jobs that exceed the usageHalfLifeTime get penalized proportionally enough to ensure fair resource allocation?)
Many short-running jobs (e.g 10000 jobs) that run for less than the usageSamplingInterval (e.g., under 5 minutes), thereby avoiding being accounted in the usage metric. Lowering the usageSamplingInterval to capture these short-running jobs would likely put a significant load on kueue, potentially impacting its overall performance and stability.

Are there best practices to mitigate these patterns without enabling preemption?

Answered by mimowo

Aug 7, 2025

Also opened the issue: #6493 to clarify the algorithm for picking preemption targets relies on relative usage between LQs rather than priorities which seems to be a source of confusion.

View full answer

mimowo · 2025-06-10T17:38:45Z

mimowo
Jun 10, 2025
Maintainer

cc @PBundyra @mwielgus

0 replies

kimminw00 · 2025-07-05T11:55:19Z

kimminw00
Jul 5, 2025
Author

As a follow-up, are there recommended strategies within Kueue to prevent situations where a single resource (e.g., GPU, CPU, or memory) becomes fully utilized and causes other available resources to remain idle because jobs requesting multiple resources cannot be scheduled? For instance, a job consuming all CPUs might block memory-heavy jobs, even if memory is still available.

0 replies

mimowo · 2025-07-07T08:31:08Z

mimowo
Jul 7, 2025
Maintainer

Here is the snapshot of my knowledge:

Long-running jobs that exceed the usageHalfLifeTime (e.g., 30 days) and monopolize resources, possibly delaying other workloads indefinitely. (Do long-running jobs that exceed the usageHalfLifeTime get penalized proportionally enough to ensure fair resource allocation?)

There is no such additional penalty AFAIK.

Many short-running jobs (e.g 10000 jobs) that run for less than the usageSamplingInterval (e.g., under 5 minutes), thereby avoiding being accounted in the usage metric. Lowering the usageSamplingInterval to capture these short-running jobs would likely put a significant load on kueue, potentially impacting its overall performance and stability.

I don't think we currently have any additional mechanism rather than lowering usageSamplingInterval. Still, on a very busy cluster with many short workloads an extra reconcile every 1min does not seem that bad.

Are there best practices to mitigate these patterns without enabling preemption?

I don't think so. Please also know that we are changing the ordering of workloads for preemption in 0.13: #5632

As a follow-up, are there recommended strategies within Kueue to prevent situations where a single resource (e.g., GPU, CPU, or memory) becomes fully utilized and causes other available resources to remain idle because jobs requesting multiple resources cannot be scheduled? For instance, a job consuming all CPUs might block memory-heavy jobs, even if memory is still available.

Admission Fair Sharing is still a new feature, I don't think such "best practices" exist at the moment. One may consider a separate CQ for CPU and GPU heavy jobs.

However, the best person to address the questions would be @PBundyra, but he is on vacation until July 21st.

Maybe in the meanwhlie @mwielgus or @mwysokin could share some extra knowledge here.

0 replies

PBundyra · 2025-08-06T09:23:34Z

PBundyra
Aug 6, 2025

Hi @kimminw00 thanks for reaching out

Long-running jobs that exceed the usageHalfLifeTime (e.g., 30 days) and monopolize resources, possibly delaying other workloads indefinitely. (Do long-running jobs that exceed the usageHalfLifeTime get penalized proportionally enough to ensure fair resource allocation?)

There is no such a penalty but please note that even if a workload surpasses usageHalfLifeTime then the LocalQueue's usage isn't higher than requested resources by the workload.

E.g. There's a long-running workload that consumes 100 GPUs

Even it runs for a long period of time the usage in LocalQueue's status won't surpass 100GPUs

Many short-running jobs (e.g 10000 jobs) that run for less than the usageSamplingInterval (e.g., under 5 minutes), thereby avoiding being accounted in the usage metric. Lowering the usageSamplingInterval to capture these short-running jobs would likely put a significant load on kueue, potentially impacting its overall performance and stability.

In Kueue v0.13.0 we introduced an entry penalty for every admitted job, so even they run for less than sampling, some usage will be accounted. At the same time, as @mimowo has said lowering the sampling interval to 1min shouldn't be much of a problem, as it's just an extra reconcile every minute per LocalQueue, not per job

More about the entry penalty: https://kueue.sigs.k8s.io/docs/concepts/admission_fair_sharing/#entry-penalty

0 replies

PBundyra · 2025-08-06T09:29:50Z

PBundyra
Aug 6, 2025

You could also dedicate a separate LQ for a long-running jobs and balance the weight of the usage with fairSharing.weight https://kueue.sigs.k8s.io/docs/concepts/admission_fair_sharing/#entry-penalty

E.g. You could decrease the importance of the usage for a dedicated LocalQueue by 10, setting the weight to 0.1

0 replies

kimminw00 · 2025-08-07T01:40:46Z

kimminw00
Aug 7, 2025
Author

I don't think so. Please also know that we are changing the ordering of workloads for preemption in 0.13: #5632

In scenarios where all workloads have the same priority and use a single shared resource pool(single cluster queue), the existing preemption logic may not function effectively due to the absence of borrowing or priority differences(to prevent users from inflating their workloads' priority to gain resource access) to trigger preemption.

To address this, how about implementing a new preemption strategy that targets workloads exceeding a predefined resource usage limit(or maximum execution time)? This approach ensures fair resource distribution and prevents prolonged jobs from monopolizing resources, even when priorities are equal and nominal quota overuses are undefined.

Key Points:
Preemption is triggered only for workloads surpassing the specified resource usage limit(or maximum execution time), ensuring that running workloads have sufficient execution time before being considered for preemption. If no such workloads exist, incoming workloads will remain in the local queue until resources become available.
This strategy is particularly beneficial in environments with a single shared resource pool and uniform priority.

2 replies

mimowo Aug 7, 2025
Maintainer

In scenarios where all workloads have the same priority and use a single shared resource pool(single cluster queue), the existing preemption logic may not function effectively due to the absence of quota overuse ...

Let's sort terminology first, what do you mean by " quota overuse" in the context of the setup with single CQ? In that case we don't have borrowing, so I'm not clear how quota can be overused.

... or priority differences(to prevent users from inflating their workloads' priority to gain resource access) to trigger preemption.

Maybe there is some misconception about preemption in AFS. The new preemption ordering policy (implemented in #5632) in AFS is designed specifically to address the problem of picking preemption targets in the setup (single CQ pointed by multiple LQs) with respect to the relative "usage" between teams represented as LocalQueues. Priority is used only as a tiebreaker in unlikely scenarios when the relative resource usage is equal.

To address this, how about implementing a new preemption strategy that targets workloads exceeding a predefined resource usage limit(or maximum execution time)?

The maximum resource usage in Kueue is defined by Quota. Also, there is already a cap on the maximum execution time.

This strategy is particularly beneficial in environments with a single shared resource pool and uniform priority.

We already have "strong limits" on resource usage (by Quota) and maximum execution time. AFS is specifically designed to provide "soft" policies for ordering of scheduling and preemption targets in environments for single CQ shared by multiple LQs.

I would like to first ensure we are on the same page regarding the currently available features. Maybe they could already address your business needs, or get us "close enough" there.

mimowo Aug 7, 2025
Maintainer

Also opened the issue: #6493 to clarify the algorithm for picking preemption targets relies on relative usage between LQs rather than priorities which seems to be a source of confusion.

Answer selected by kimminw00

kimminw00 · 2025-08-07T15:57:41Z

kimminw00
Aug 7, 2025
Author

Thanks for the detailed explanation! I now understand that the recent work on preemption in AFS, particularly the new ordering policy (e.g., #5632), is focused on determining which workloads should be preempted, especially in the context of a single ClusterQueue (CQ) being shared by multiple LocalQueues (LQs).

However, my original intention was to ask not about "which" workloads should be preempted, but rather "what conditions" preemption in AFS is triggered.

From the Kueue documentation on preemption, it seems that preemption can be triggered in a few scenarios, such as borrowing or priority differences. But in a setup with a single CQ, where borrowing is not possible and all workloads have the same priority, it's unclear when or if preemption would actually occur.

Could you clarify the specific conditions that trigger preemption in such a configuration?

10 replies

PBundyra Aug 8, 2025

cc @kimminw00

kimminw00 Aug 9, 2025
Author

After a thorough review, I believe we can achieve desired fair scheduling for setup of a single ClusterQueue with multiple LocalQueues by combining the following Kueue features and policies:

AFS Ordering: To ensure long-term fairness between queues based on resource usage.
maximumExecutionTimeSeconds: To prevent any single job from monopolizing resources indefinitely.
entryPenalty: To mitigate short-term gaming of the fair-sharing mechanism.
LowerOrNewerEqualPriority: To balance the needs of high-priority jobs with FIFO fairness for older, same-priority jobs.
Restricting Priority Usage (via OPA or kyverno): As an operational policy to prevent users from abusing the priority system.
To prevent resource fragmentation, we could adopt a model similar to AWS/GCP instance types, where a specific combination of resources (e.g., 4 CPU, 36Gi Memory, 1 GPU) is allocated as a single logical unit. However, I'm not sure if K8S or Kueue supports this.

This combination seems to provide a robust solution for the fair scheduling I have in mind. I'm curious if there are any other aspects or potential gaps in this approach.

mimowo Aug 11, 2025
Maintainer

To prevent resource fragmentation, we could adopt a model similar to AWS/GCP instance types, where a specific combination of resources (e.g., 4 CPU, 36Gi Memory, 1 GPU) is allocated as a single logical unit. However, I'm not sure if K8S or Kueue supports this.

To handle resource fagmentation Kueue is employing two main strategies:

WaitForPodsReady(mitigation to retry after a delay)
TopologyAwareScheduling - even if you don't have the deep datacenter topology you can still use it with single level kubernetes.io/hostname, see example in the PR: Demonstrate TAS use for rejecting unscheduable workloads at admission (due to node fragmentation) #6214

I'm curious if there are any other aspects or potential gaps in this approach.

sgtm, over time you may want to consider splitting the single CQ into dedicated ones which would allow for mixing Fair Sharing at the Cohort level (allows preemptions regardless of priorities) with Admission Fair Sharing.

This would be closer conceptually to what we were thinking when designing AFS.

kimminw00 Aug 15, 2025
Author

This would be closer conceptually to what we were thinking when designing AFS.

Thanks for the suggestion! Could you elaborate more on the original design intent for AFS?

PBundyra Aug 18, 2025

The fundamental intent behind AFS was to provide a mechanism for sharing resources fairly without preemption, as we already had a FairSharing feature that is based on that. We needed a feature that provides fairness in the moment of admission, based on historical usage and this is the core of AFS. Later on we've added ordering preemption candidates, as when using Kueue with Cohorts and borrowing/sharing quota it makes sense to pick candidates from LQ/CQ with the highest historical usage. AFS in this setup doesn't decide to preempt other jobs though - after the decision is made, it ensures we pick candidates in a fair manner.

You can find out more about goals and use-cases in the KEP: https://github.com/kubernetes-sigs/kueue/tree/main/keps/4136-admission-fair-sharing

Handling Long/Short Workloads in AdmissionFairSharing #5609

Uh oh!

Uh oh!

kimminw00 Jun 10, 2025

Replies: 7 comments · 12 replies

Uh oh!

mimowo Jun 10, 2025 Maintainer

Uh oh!

kimminw00 Jul 5, 2025 Author

Uh oh!

mimowo Jul 7, 2025 Maintainer

Uh oh!

Uh oh!

PBundyra Aug 6, 2025

Uh oh!

PBundyra Aug 6, 2025

Uh oh!

Uh oh!

kimminw00 Aug 7, 2025 Author

Uh oh!

mimowo Aug 7, 2025 Maintainer

Uh oh!

mimowo Aug 7, 2025 Maintainer

Uh oh!

Uh oh!

kimminw00 Aug 7, 2025 Author

Uh oh!

PBundyra Aug 8, 2025

Uh oh!

Uh oh!

kimminw00 Aug 9, 2025 Author

Uh oh!

mimowo Aug 11, 2025 Maintainer

Uh oh!

kimminw00 Aug 15, 2025 Author

Uh oh!

PBundyra Aug 18, 2025

Handling Long/Short Workloads in `AdmissionFairSharing` #5609

kimminw00
Jun 10, 2025

Replies: 7 comments 12 replies

mimowo
Jun 10, 2025
Maintainer

kimminw00
Jul 5, 2025
Author

mimowo
Jul 7, 2025
Maintainer

PBundyra
Aug 6, 2025

PBundyra
Aug 6, 2025

kimminw00
Aug 7, 2025
Author

mimowo Aug 7, 2025
Maintainer

mimowo Aug 7, 2025
Maintainer

kimminw00
Aug 7, 2025
Author

kimminw00 Aug 9, 2025
Author

mimowo Aug 11, 2025
Maintainer

kimminw00 Aug 15, 2025
Author