Skip to content

Commit f60a631

Browse files
vingu-linaroIngo Molnar
authored andcommitted
sched/fair: Fix tg->load when offlining a CPU
When a CPU is taken offline, the contribution of its cfs_rqs to task_groups' load may remain and will negatively impact the calculation of the share of the online CPUs. To fix this bug, clear the contribution of an offlining CPU to task groups' load and skip its contribution while it is inactive. Here's the reproducer of the anomaly, by Imran Khan: "So far I have encountered only one rather lengthy way of reproducing this issue, which is as follows: 1. Take a KVM guest (booted with 4 CPUs and can be scaled up to 124 CPUs) and create 2 custom cgroups: /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/ cpu/test_group_2 2. Assign a CPU intensive workload to each of these cgroups and start the workload. For my tests I am using following app: int main(int argc, char *argv[]) { unsigned long count, i, val; if (argc != 2) { printf("usage: ./a.out <number of random nums to generate> \n"); return 0; } count = strtoul(argv[1], NULL, 10); printf("Generating %lu random numbers \n", count); for (i = 0; i < count; i++) { val = rand(); val = val % 2; //usleep(1); } printf("Generated %lu random numbers \n", count); return 0; } Also since the system is booted with 4 CPUs, in order to completely load the system I am also launching 4 instances of same test app under: /sys/fs/cgroup/cpu/ 3. We can see that both of the cgroups get similar CPU time: # systemd-cgtop --depth 1 Path Tasks %CPU Memory Input/s Output/s / 659 - 5.5G - - /system.slice - - 5.7G - - /test_group_1 4 - - - - /test_group_2 3 - - - - /user.slice 31 - 56.5M - - Path Tasks %CPU Memory Input/s Output/s / 659 394.6 5.5G - - /test_group_2 3 65.7 - - - /user.slice 29 55.1 48.0M - - /test_group_1 4 47.3 - - - /system.slice - 2.2 5.7G - - Path Tasks %CPU Memory Input/s Output/s / 659 394.8 5.5G - - /test_group_1 4 62.9 - - - /user.slice 28 44.9 54.2M - - /test_group_2 3 44.7 - - - /system.slice - 0.9 5.7G - - Path Tasks %CPU Memory Input/s Output/s / 659 394.4 5.5G - - /test_group_2 3 58.8 - - - /test_group_1 4 51.9 - - - /user.slice 30 39.3 59.6M - - /system.slice - 1.9 5.7G - - Path Tasks %CPU Memory Input/s Output/s / 659 394.7 5.5G - - /test_group_1 4 60.9 - - - /test_group_2 3 57.9 - - - /user.slice 28 43.5 36.9M - - /system.slice - 3.0 5.7G - - Path Tasks %CPU Memory Input/s Output/s / 659 395.0 5.5G - - /test_group_1 4 66.8 - - - /test_group_2 3 56.3 - - - /user.slice 29 43.1 51.8M - - /system.slice - 0.7 5.7G - - 4. Now move systemd-udevd to one of these test groups, say test_group_1, and perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the host side. 5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu and one instance of CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/test_group_2. It can be seen that test_group_1 (the one where systemd-udevd was moved) is getting much less CPU time than the test_group_2, even though at this point of time both of these groups have only CPU hogger running: # systemd-cgtop --depth 1 Path Tasks %CPU Memory Input/s Output/s / 1219 - 5.4G - - /system.slice - - 5.6G - - /test_group_1 4 - - - - /test_group_2 3 - - - - /user.slice 26 - 91.3M - - Path Tasks %CPU Memory Input/s Output/s / 1221 394.3 5.4G - - /test_group_2 3 82.7 - - - /test_group_1 4 14.3 - - - /system.slice - 0.8 5.6G - - /user.slice 26 0.4 91.2M - - Path Tasks %CPU Memory Input/s Output/s / 1221 394.6 5.4G - - /test_group_2 3 67.4 - - - /system.slice - 24.6 5.6G - - /test_group_1 4 12.5 - - - /user.slice 26 0.4 91.2M - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.2 5.4G - - /test_group_2 3 60.9 - - - /system.slice - 27.9 5.6G - - /test_group_1 4 12.2 - - - /user.slice 26 0.4 91.2M - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.2 5.4G - - /test_group_2 3 69.4 - - - /test_group_1 4 13.9 - - - /user.slice 28 1.6 92.0M - - /system.slice - 1.0 5.6G - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.6 5.4G - - /test_group_2 3 59.3 - - - /test_group_1 4 14.1 - - - /user.slice 28 1.3 92.2M - - /system.slice - 0.7 5.6G - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.5 5.4G - - /test_group_2 3 67.2 - - - /test_group_1 4 11.5 - - - /user.slice 28 1.3 92.5M - - /system.slice - 0.6 5.6G - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.1 5.4G - - /test_group_2 3 76.8 - - - /test_group_1 4 12.9 - - - /user.slice 28 1.3 92.8M - - /system.slice - 1.2 5.6G - - From sched_debug data it can be seen that in bad case the load.weight of per-CPU sched entities corresponding to test_group_1 has reduced significantly and also load_avg of test_group_1 remains much higher than that of test_group_2, even though systemd-udevd stopped running long time back and at this point of time both cgroups just have the CPU hogger app as running entity." [ mingo: Added details from the original discussion, plus minor edits to the patch. ] Reported-by: Imran Khan <imran.f.khan@oracle.com> Tested-by: Imran Khan <imran.f.khan@oracle.com> Tested-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Imran Khan <imran.f.khan@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Link: https://lore.kernel.org/r/20231223111545.62135-1-vincent.guittot@linaro.org
1 parent 5254c0c commit f60a631

File tree

1 file changed

+52
-0
lines changed

1 file changed

+52
-0
lines changed

kernel/sched/fair.c

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4096,6 +4096,10 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
40964096
if (cfs_rq->tg == &root_task_group)
40974097
return;
40984098

4099+
/* rq has been offline and doesn't contribute to the share anymore: */
4100+
if (!cpu_active(cpu_of(rq_of(cfs_rq))))
4101+
return;
4102+
40994103
/*
41004104
* For migration heavy workloads, access to tg->load_avg can be
41014105
* unbound. Limit the update rate to at most once per ms.
@@ -4112,6 +4116,49 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
41124116
}
41134117
}
41144118

4119+
static inline void clear_tg_load_avg(struct cfs_rq *cfs_rq)
4120+
{
4121+
long delta;
4122+
u64 now;
4123+
4124+
/*
4125+
* No need to update load_avg for root_task_group, as it is not used.
4126+
*/
4127+
if (cfs_rq->tg == &root_task_group)
4128+
return;
4129+
4130+
now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
4131+
delta = 0 - cfs_rq->tg_load_avg_contrib;
4132+
atomic_long_add(delta, &cfs_rq->tg->load_avg);
4133+
cfs_rq->tg_load_avg_contrib = 0;
4134+
cfs_rq->last_update_tg_load_avg = now;
4135+
}
4136+
4137+
/* CPU offline callback: */
4138+
static void __maybe_unused clear_tg_offline_cfs_rqs(struct rq *rq)
4139+
{
4140+
struct task_group *tg;
4141+
4142+
lockdep_assert_rq_held(rq);
4143+
4144+
/*
4145+
* The rq clock has already been updated in
4146+
* set_rq_offline(), so we should skip updating
4147+
* the rq clock again in unthrottle_cfs_rq().
4148+
*/
4149+
rq_clock_start_loop_update(rq);
4150+
4151+
rcu_read_lock();
4152+
list_for_each_entry_rcu(tg, &task_groups, list) {
4153+
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
4154+
4155+
clear_tg_load_avg(cfs_rq);
4156+
}
4157+
rcu_read_unlock();
4158+
4159+
rq_clock_stop_loop_update(rq);
4160+
}
4161+
41154162
/*
41164163
* Called within set_task_rq() right before setting a task's CPU. The
41174164
* caller only guarantees p->pi_lock is held; no other assumptions,
@@ -4408,6 +4455,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
44084455

44094456
static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
44104457

4458+
static inline void clear_tg_offline_cfs_rqs(struct rq *rq) {}
4459+
44114460
static inline int propagate_entity_load_avg(struct sched_entity *se)
44124461
{
44134462
return 0;
@@ -12413,6 +12462,9 @@ static void rq_offline_fair(struct rq *rq)
1241312462

1241412463
/* Ensure any throttled groups are reachable by pick_next_task */
1241512464
unthrottle_offline_cfs_rqs(rq);
12465+
12466+
/* Ensure that we remove rq contribution to group share: */
12467+
clear_tg_offline_cfs_rqs(rq);
1241612468
}
1241712469

1241812470
#endif /* CONFIG_SMP */

0 commit comments

Comments
 (0)