Skip to content

Commit b167fdf

Browse files
committed
Merge tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: "Load-balancing improvements: - Improve NUMA balancing on AMD Zen systems for affine workloads. - Improve the handling of reduced-capacity CPUs in load-balancing. - Energy Model improvements: fix & refine all the energy fairness metrics (PELT), and remove the conservative threshold requiring 6% energy savings to migrate a task. Doing this improves power efficiency for most workloads, and also increases the reliability of energy-efficiency scheduling. - Optimize/tweak select_idle_cpu() to spend (much) less time searching for an idle CPU on overloaded systems. There's reports of several milliseconds spent there on large systems with large workloads ... [ Since the search logic changed, there might be behavioral side effects. ] - Improve NUMA imbalance behavior. On certain systems with spare capacity, initial placement of tasks is non-deterministic, and such an artificial placement imbalance can persist for a long time, hurting (and sometimes helping) performance. The fix is to make fork-time task placement consistent with runtime NUMA balancing placement. Note that some performance regressions were reported against this, caused by workloads that are not memory bandwith limited, which benefit from the artificial locality of the placement bug(s). Mel Gorman's conclusion, with which we concur, was that consistency is better than random workload benefits from non-deterministic bugs: "Given there is no crystal ball and it's a tradeoff, I think it's better to be consistent and use similar logic at both fork time and runtime even if it doesn't have universal benefit." - Improve core scheduling by fixing a bug in sched_core_update_cookie() that caused unnecessary forced idling. - Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs for newly woken tasks. - Fix a newidle balancing bug that introduced unnecessary wakeup latencies. ABI improvements/fixes: - Do not check capabilities and do not issue capability check denial messages when a scheduler syscall doesn't require privileges. (Such as increasing niceness.) - Add forced-idle accounting to cgroups too. - Fix/improve the RSEQ ABI to not just silently accept unknown flags. (No existing tooling is known to have learned to rely on the previous behavior.) - Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags. Optimizations: - Optimize & simplify leaf_cfs_rq_list() - Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg(). Misc fixes & cleanups: - Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems. - Fix a full-NOHZ bug that can in some cases result in the tick not being re-enabled when the last SCHED_RT task is gone from a runqueue but there's still SCHED_OTHER tasks around. - Various PREEMPT_RT related fixes. - Misc cleanups & smaller fixes" * tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits) rseq: Kill process when unknown flags are encountered in ABI structures rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags sched/core: Fix the bug that task won't enqueue into core tree when update cookie nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt() sched/core: Always flush pending blk_plug sched/fair: fix case with reduced capacity CPU sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling sched/core: add forced idle accounting for cgroups sched/fair: Remove the energy margin in feec() sched/fair: Remove task_util from effective utilization in feec() sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() sched/fair: Rename select_idle_mask to select_rq_mask sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() sched/fair: Decay task PELT values during wakeup migration sched/fair: Provide u64 read for 32-bits arch helper sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg sched: only perform capability check on privileged operation sched: Remove unused function group_first_cpu() sched/fair: Remove redundant word " *" selftests/rseq: check if libc rseq support is registered ...
2 parents 0dd1cab + c17a6ff commit b167fdf

File tree

22 files changed

+888
-511
lines changed

22 files changed

+888
-511
lines changed

drivers/powercap/dtpm_cpu.c

Lines changed: 9 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -71,34 +71,19 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
7171

7272
static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power)
7373
{
74-
unsigned long max = 0, sum_util = 0;
74+
unsigned long max, sum_util = 0;
7575
int cpu;
7676

77-
for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
78-
79-
/*
80-
* The capacity is the same for all CPUs belonging to
81-
* the same perf domain, so a single call to
82-
* arch_scale_cpu_capacity() is enough. However, we
83-
* need the CPU parameter to be initialized by the
84-
* loop, so the call ends up in this block.
85-
*
86-
* We can initialize 'max' with a cpumask_first() call
87-
* before the loop but the bits computation is not
88-
* worth given the arch_scale_cpu_capacity() just
89-
* returns a value where the resulting assembly code
90-
* will be optimized by the compiler.
91-
*/
92-
max = arch_scale_cpu_capacity(cpu);
93-
sum_util += sched_cpu_util(cpu, max);
94-
}
95-
9677
/*
97-
* In the improbable case where all the CPUs of the perf
98-
* domain are offline, 'max' will be zero and will lead to an
99-
* illegal operation with a zero division.
78+
* The capacity is the same for all CPUs belonging to
79+
* the same perf domain.
10080
*/
101-
return max ? (power * ((sum_util << 10) / max)) >> 10 : 0;
81+
max = arch_scale_cpu_capacity(cpumask_first(pd_mask));
82+
83+
for_each_cpu_and(cpu, pd_mask, cpu_online_mask)
84+
sum_util += sched_cpu_util(cpu);
85+
86+
return (power * ((sum_util << 10) / max)) >> 10;
10287
}
10388

10489
static u64 get_pd_power_uw(struct dtpm *dtpm)

drivers/thermal/cpufreq_cooling.c

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -137,11 +137,9 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
137137
static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
138138
int cpu_idx)
139139
{
140-
unsigned long max = arch_scale_cpu_capacity(cpu);
141-
unsigned long util;
140+
unsigned long util = sched_cpu_util(cpu);
142141

143-
util = sched_cpu_util(cpu, max);
144-
return (util * 100) / max;
142+
return (util * 100) / arch_scale_cpu_capacity(cpu);
145143
}
146144
#else /* !CONFIG_SMP */
147145
static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,

include/linux/cgroup-defs.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -288,6 +288,10 @@ struct css_set {
288288

289289
struct cgroup_base_stat {
290290
struct task_cputime cputime;
291+
292+
#ifdef CONFIG_SCHED_CORE
293+
u64 forceidle_sum;
294+
#endif
291295
};
292296

293297
/*

include/linux/kernel_stat.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,9 @@ enum cpu_usage_stat {
2828
CPUTIME_STEAL,
2929
CPUTIME_GUEST,
3030
CPUTIME_GUEST_NICE,
31+
#ifdef CONFIG_SCHED_CORE
32+
CPUTIME_FORCEIDLE,
33+
#endif
3134
NR_STATS,
3235
};
3336

@@ -115,4 +118,8 @@ extern void account_process_tick(struct task_struct *, int user);
115118

116119
extern void account_idle_ticks(unsigned long ticks);
117120

121+
#ifdef CONFIG_SCHED_CORE
122+
extern void __account_forceidle_time(struct task_struct *tsk, u64 delta);
123+
#endif
124+
118125
#endif /* _LINUX_KERNEL_STAT_H */

include/linux/sched.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2257,7 +2257,7 @@ static inline bool owner_on_cpu(struct task_struct *owner)
22572257
}
22582258

22592259
/* Returns effective CPU energy utilization, as seen by the scheduler */
2260-
unsigned long sched_cpu_util(int cpu, unsigned long max);
2260+
unsigned long sched_cpu_util(int cpu);
22612261
#endif /* CONFIG_SMP */
22622262

22632263
#ifdef CONFIG_RSEQ

include/linux/sched/rt.h

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,20 +39,12 @@ static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *p)
3939
}
4040
extern void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task);
4141
extern void rt_mutex_adjust_pi(struct task_struct *p);
42-
static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
43-
{
44-
return tsk->pi_blocked_on != NULL;
45-
}
4642
#else
4743
static inline struct task_struct *rt_mutex_get_top_task(struct task_struct *task)
4844
{
4945
return NULL;
5046
}
5147
# define rt_mutex_adjust_pi(p) do { } while (0)
52-
static inline bool tsk_is_pi_blocked(struct task_struct *tsk)
53-
{
54-
return false;
55-
}
5648
#endif
5749

5850
extern void normalize_rt_tasks(void);

include/linux/sched/topology.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@ struct sched_domain_shared {
8181
atomic_t ref;
8282
atomic_t nr_busy_cpus;
8383
int has_idle_cores;
84+
int nr_idle_scan;
8485
};
8586

8687
struct sched_domain {

kernel/cgroup/rstat.c

Lines changed: 38 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -310,6 +310,9 @@ static void cgroup_base_stat_add(struct cgroup_base_stat *dst_bstat,
310310
dst_bstat->cputime.utime += src_bstat->cputime.utime;
311311
dst_bstat->cputime.stime += src_bstat->cputime.stime;
312312
dst_bstat->cputime.sum_exec_runtime += src_bstat->cputime.sum_exec_runtime;
313+
#ifdef CONFIG_SCHED_CORE
314+
dst_bstat->forceidle_sum += src_bstat->forceidle_sum;
315+
#endif
313316
}
314317

315318
static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
@@ -318,6 +321,9 @@ static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat,
318321
dst_bstat->cputime.utime -= src_bstat->cputime.utime;
319322
dst_bstat->cputime.stime -= src_bstat->cputime.stime;
320323
dst_bstat->cputime.sum_exec_runtime -= src_bstat->cputime.sum_exec_runtime;
324+
#ifdef CONFIG_SCHED_CORE
325+
dst_bstat->forceidle_sum -= src_bstat->forceidle_sum;
326+
#endif
321327
}
322328

323329
static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
@@ -398,6 +404,11 @@ void __cgroup_account_cputime_field(struct cgroup *cgrp,
398404
case CPUTIME_SOFTIRQ:
399405
rstatc->bstat.cputime.stime += delta_exec;
400406
break;
407+
#ifdef CONFIG_SCHED_CORE
408+
case CPUTIME_FORCEIDLE:
409+
rstatc->bstat.forceidle_sum += delta_exec;
410+
break;
411+
#endif
401412
default:
402413
break;
403414
}
@@ -411,8 +422,9 @@ void __cgroup_account_cputime_field(struct cgroup *cgrp,
411422
* with how it is done by __cgroup_account_cputime_field for each bit of
412423
* cpu time attributed to a cgroup.
413424
*/
414-
static void root_cgroup_cputime(struct task_cputime *cputime)
425+
static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
415426
{
427+
struct task_cputime *cputime = &bstat->cputime;
416428
int i;
417429

418430
cputime->stime = 0;
@@ -438,34 +450,54 @@ static void root_cgroup_cputime(struct task_cputime *cputime)
438450
cputime->sum_exec_runtime += user;
439451
cputime->sum_exec_runtime += sys;
440452
cputime->sum_exec_runtime += cpustat[CPUTIME_STEAL];
453+
454+
#ifdef CONFIG_SCHED_CORE
455+
bstat->forceidle_sum += cpustat[CPUTIME_FORCEIDLE];
456+
#endif
441457
}
442458
}
443459

444460
void cgroup_base_stat_cputime_show(struct seq_file *seq)
445461
{
446462
struct cgroup *cgrp = seq_css(seq)->cgroup;
447463
u64 usage, utime, stime;
448-
struct task_cputime cputime;
464+
struct cgroup_base_stat bstat;
465+
#ifdef CONFIG_SCHED_CORE
466+
u64 forceidle_time;
467+
#endif
449468

450469
if (cgroup_parent(cgrp)) {
451470
cgroup_rstat_flush_hold(cgrp);
452471
usage = cgrp->bstat.cputime.sum_exec_runtime;
453472
cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
454473
&utime, &stime);
474+
#ifdef CONFIG_SCHED_CORE
475+
forceidle_time = cgrp->bstat.forceidle_sum;
476+
#endif
455477
cgroup_rstat_flush_release();
456478
} else {
457-
root_cgroup_cputime(&cputime);
458-
usage = cputime.sum_exec_runtime;
459-
utime = cputime.utime;
460-
stime = cputime.stime;
479+
root_cgroup_cputime(&bstat);
480+
usage = bstat.cputime.sum_exec_runtime;
481+
utime = bstat.cputime.utime;
482+
stime = bstat.cputime.stime;
483+
#ifdef CONFIG_SCHED_CORE
484+
forceidle_time = bstat.forceidle_sum;
485+
#endif
461486
}
462487

463488
do_div(usage, NSEC_PER_USEC);
464489
do_div(utime, NSEC_PER_USEC);
465490
do_div(stime, NSEC_PER_USEC);
491+
#ifdef CONFIG_SCHED_CORE
492+
do_div(forceidle_time, NSEC_PER_USEC);
493+
#endif
466494

467495
seq_printf(seq, "usage_usec %llu\n"
468496
"user_usec %llu\n"
469497
"system_usec %llu\n",
470498
usage, utime, stime);
499+
500+
#ifdef CONFIG_SCHED_CORE
501+
seq_printf(seq, "core_sched.force_idle_usec %llu\n", forceidle_time);
502+
#endif
471503
}

kernel/rseq.c

Lines changed: 8 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,9 @@
1818
#define CREATE_TRACE_POINTS
1919
#include <trace/events/rseq.h>
2020

21-
#define RSEQ_CS_PREEMPT_MIGRATE_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE | \
22-
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)
21+
#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \
22+
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
23+
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
2324

2425
/*
2526
*
@@ -175,23 +176,15 @@ static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
175176
u32 flags, event_mask;
176177
int ret;
177178

179+
if (WARN_ON_ONCE(cs_flags & RSEQ_CS_NO_RESTART_FLAGS) || cs_flags)
180+
return -EINVAL;
181+
178182
/* Get thread flags. */
179183
ret = get_user(flags, &t->rseq->flags);
180184
if (ret)
181185
return ret;
182186

183-
/* Take critical section flags into account. */
184-
flags |= cs_flags;
185-
186-
/*
187-
* Restart on signal can only be inhibited when restart on
188-
* preempt and restart on migrate are inhibited too. Otherwise,
189-
* a preempted signal handler could fail to restart the prior
190-
* execution context on sigreturn.
191-
*/
192-
if (unlikely((flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) &&
193-
(flags & RSEQ_CS_PREEMPT_MIGRATE_FLAGS) !=
194-
RSEQ_CS_PREEMPT_MIGRATE_FLAGS))
187+
if (WARN_ON_ONCE(flags & RSEQ_CS_NO_RESTART_FLAGS) || flags)
195188
return -EINVAL;
196189

197190
/*
@@ -203,7 +196,7 @@ static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
203196
t->rseq_event_mask = 0;
204197
preempt_enable();
205198

206-
return !!(event_mask & ~flags);
199+
return !!event_mask;
207200
}
208201

209202
static int clear_rseq_cs(struct task_struct *t)

0 commit comments

Comments
 (0)