Skip to content

Commit 3429dd5

Browse files
kudureranganathPeter Zijlstra
authored andcommitted
sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an entity is delayed irrespective of whether the entity corresponds to a task or a cfs_rq. Consider the following scenario: root / \ A B (*) delayed since B is no longer eligible on root | | Task0 Task1 <--- dequeue_task_fair() - task blocks When Task1 blocks (dequeue_entity() for task's se returns true), dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the hierarchy of Task1. However, when the sched_entity corresponding to cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the hierarchy too leading to both dequeue_entity() and set_delayed() decrementing h_nr_runnable for the dequeue of the same task. A SCHED_WARN_ON() to inspect h_nr_runnable post its update in dequeue_entities() like below: cfs_rq->h_nr_runnable -= h_nr_runnable; SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); is consistently tripped when running wakeup intensive workloads like hackbench in a cgroup. This error is self correcting since cfs_rq are per-cpu and cannot migrate. The entitiy is either picked for full dequeue or is requeued when a task wakes up below it. Both those paths call clear_delayed() which again increments h_nr_runnable of the hierarchy without considering if the entity corresponds to a task or not. h_nr_runnable will eventually reflect the correct value however in the interim, the incorrect values can still influence PELT calculation which uses se->runnable_weight or cfs_rq->h_nr_runnable. Since only delayed tasks take the early return path in dequeue_entities() and enqueue_task_fair(), adjust the h_nr_runnable in {set,clear}_delayed() only when a task is delayed as this path skips the h_nr_* update loops and returns early. For entities corresponding to cfs_rq, the h_nr_* update loop in the caller will do the right thing. Fixes: 76f2f78 ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Link: https://lkml.kernel.org/r/20250117105852.23908-1-kprateek.nayak@amd.com
1 parent 95ec54a commit 3429dd5

File tree

1 file changed

+19
-0
lines changed

1 file changed

+19
-0
lines changed

kernel/sched/fair.c

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5372,6 +5372,15 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
53725372
static void set_delayed(struct sched_entity *se)
53735373
{
53745374
se->sched_delayed = 1;
5375+
5376+
/*
5377+
* Delayed se of cfs_rq have no tasks queued on them.
5378+
* Do not adjust h_nr_runnable since dequeue_entities()
5379+
* will account it for blocked tasks.
5380+
*/
5381+
if (!entity_is_task(se))
5382+
return;
5383+
53755384
for_each_sched_entity(se) {
53765385
struct cfs_rq *cfs_rq = cfs_rq_of(se);
53775386

@@ -5384,6 +5393,16 @@ static void set_delayed(struct sched_entity *se)
53845393
static void clear_delayed(struct sched_entity *se)
53855394
{
53865395
se->sched_delayed = 0;
5396+
5397+
/*
5398+
* Delayed se of cfs_rq have no tasks queued on them.
5399+
* Do not adjust h_nr_runnable since a dequeue has
5400+
* already accounted for it or an enqueue of a task
5401+
* below it will account for it in enqueue_task_fair().
5402+
*/
5403+
if (!entity_is_task(se))
5404+
return;
5405+
53875406
for_each_sched_entity(se) {
53885407
struct cfs_rq *cfs_rq = cfs_rq_of(se);
53895408

0 commit comments

Comments
 (0)