single inadmissible workload from quota never gets pending condition with best effort fifo

**What happened**:

While working on the integration test for https://github.com/kubernetes-sigs/kueue/issues/4934 // https://github.com/kubernetes-sigs/kueue/pull/4935 I noticed a curious issue

Workloads never get any conditions set, but are actively being hit by both the scheduler and the reconciler. The culprit seems to be a resource version conflict between scheduler and reconciler such that the scheduler never applies this status update during failed admission of head entries: https://github.com/kubernetes-sigs/kueue/blob/3279d9c05817e465229fac6bdc64250c890ea7dd/pkg/scheduler/scheduler.go#L659

resulting in e.g.
```
  2025-04-21T17:49:24.03698-04:00	ERROR	scheduler	scheduler/scheduler.go:685	Could not update Workload status	{"schedulingCycle": 5, "error": "Operation cannot be fulfilled on workloads.kueue.x-k8s.io \"admission-check-wl2\": the object has been modified; please apply your changes to the latest version and try again"}
  sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).requeueAndUpdate
  	/Users/alexeldeib/code/kueue/pkg/scheduler/scheduler.go:685
  sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).schedule
  	/Users/alexeldeib/code/kueue/pkg/scheduler/scheduler.go:302
  sigs.k8s.io/kueue/pkg/util/wait.untilWithBackoff.func1
  	/Users/alexeldeib/code/kueue/pkg/util/wait/backoff.go:43
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
  	/Users/alexeldeib/code/kueue/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226
  k8s.io/apimachinery/pkg/util/wait.BackoffUntil
  	/Users/alexeldeib/code/kueue/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227
  sigs.k8s.io/kueue/pkg/util/wait.untilWithBackoff
  	/Users/alexeldeib/code/kueue/pkg/util/wait/backoff.go:42
  sigs.k8s.io/kueue/pkg/util/wait.UntilWithBackoff
  	/Users/alexeldeib/code/kueue/pkg/util/wait/backoff.go:34
```

in this case it seems there is no workload reconcile, but the workload is requeued as inadmissible and never re-scheduled/nominated when it hits https://github.com/kubernetes-sigs/kueue/blob/3279d9c05817e465229fac6bdc64250c890ea7dd/pkg/queue/cluster_queue.go#L406

OR the update passes, but the workload reconciler triggers a no-op update from pending to pending. see this code path https://github.com/kubernetes-sigs/kueue/blob/3279d9c05817e465229fac6bdc64250c890ea7dd/pkg/controller/core/workload_controller.go#L704-L705

either case ends up with the workload requeued as inadmissible, and then it may never get requeued. there is no reason a single workload would retrigger scheduling once it is inadmissible, unless other workloads are deleted, or the CQs are updated, etc.

there are two potential fixes which seem to both be required:
- trigger requeue of inadmissible workload immediately on resource version conflict (e.g. `apierrors.IsConflict`) during requeue status update
  - this solves the first case, since without immediate requeue and no additional update from workload controller, it's kaput
- trigger requeue of inadmissible workloads during the pending -> pending reconciler in workload controller as well as the default path (for spurious/uncached events).
  - this handles the case where the status update succeeds, triggers a workload reconcile, but that does not currently retrigger a scheduling loop

**What you expected to happen**:

condition status updates should occur on pending workloads

**How to reproduce it (as minimally and precisely as possible)**:

see https://github.com/kubernetes-sigs/kueue/pull/4935 -- remove the changes mentioned above and run the test added in that PR a few times, it will reproduce both variations.

**Anything else we need to know?**:

**Environment**:
- Kubernetes version (use `kubectl version`):
- Kueue version (use `git describe --tags --dirty --always`):
- Cloud provider or hardware configuration:
- OS (e.g: `cat /etc/os-release`):
- Kernel (e.g. `uname -a`):
- Install tools:
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

single inadmissible workload from quota never gets pending condition with best effort fifo #5061

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	case prevStatus == workload.StatusPending && status == workload.StatusPending:
	err := r.queues.UpdateWorkload(e.ObjectOld, wlCopy)

single inadmissible workload from quota never gets pending condition with best effort fifo #5061

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions