Skip to content

Commit 62de6e1

Browse files
committed
Merge tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: "Fair scheduler (SCHED_FAIR) enhancements: - Behavioral improvements: - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra) - Delayed-dequeue enhancements & fixes: (Vincent Guittot) - Rename h_nr_running into h_nr_queued - Add new cfs_rq.h_nr_runnable - Use the new cfs_rq.h_nr_runnable - Removed unsued cfs_rq.h_nr_delayed - Rename cfs_rq.idle_h_nr_running into h_nr_idle - Remove unused cfs_rq.idle_nr_running - Rename cfs_rq.nr_running into nr_queued - Do not try to migrate delayed dequeue task - Fix variable declaration position - Encapsulate set custom slice in a __setparam_fair() function - Fixes: - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding) - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia) - Cleanups: - Clean up in migrate_degrades_locality() to improve readability (Peter Zijlstra) - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko) - Update comments after sched_tick() rename (Sebastian Andrzej Siewior) - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used() (Valentin Schneider) Deadline scheduler (SCHED_DL) enhancements: - Restore dl_server bandwidth on non-destructive root domain changes (Juri Lelli) - Correctly account for allocated bandwidth during hotplug (Juri Lelli) - Check bandwidth overflow earlier for hotplug (Juri Lelli) - Clean up goto label in pick_earliest_pushable_dl_task() (John Stultz) - Consolidate timer cancellation (Wander Lairson Costa) Load-balancer enhancements: - Improve performance by prioritizing migrating eligible tasks in sched_balance_rq() (Hao Jia) - Do not compute NUMA Balancing stats unnecessarily during load-balancing (K Prateek Nayak) - Do not compute overloaded status unnecessarily during load-balancing (K Prateek Nayak) Generic scheduling code enhancements: - Use READ_ONCE() in task_on_rq_queued(), to consistently use the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal) Isolated CPUs support enhancements: (Waiman Long) - Make "isolcpus=nohz" equivalent to "nohz_full" - Consolidate housekeeping cpumasks that are always identical - Remove HK_TYPE_SCHED - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE RSEQ enhancements: - Validate read-only fields under DEBUG_RSEQ config (Mathieu Desnoyers) PSI enhancements: - Fix race when task wakes up before psi_sched_switch() adjusts flags (Chengming Zhou) IRQ time accounting performance enhancements: (Yafang Shao) - Define sched_clock_irqtime as static key - Don't account irq time if sched_clock_irqtime is disabled Virtual machine scheduling enhancements: - Don't try to catch up excess steal time (Suleiman Souhlal) Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak) - Convert "sysctl_sched_itmt_enabled" to boolean - Use guard() for itmt_update_mutex - Move the "sched_itmt_enabled" sysctl to debugfs - Remove x86_smt_flags and use cpu_smt_flags directly - Use x86_sched_itmt_flags for PKG domain unconditionally Debugging code & instrumentation enhancements: - Change need_resched warnings to pr_err() (David Rientjes) - Print domain name in /proc/schedstat (K Prateek Nayak) - Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra) - Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal) - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal) - Update Schedstat version to 17 (Swapnil Sapkal)" * tag 'sched-core-2025-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits) rseq: Fix rseq unregistration regression psi: Fix race when task wakes up before psi_sched_switch() adjusts flags sched, psi: Don't account irq time if sched_clock_irqtime is disabled sched: Don't account irq time if sched_clock_irqtime is disabled sched: Define sched_clock_irqtime as static key sched/fair: Do not compute overloaded status unnecessarily during lb sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs x86/itmt: Use guard() for itmt_update_mutex x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean sched/core: Prioritize migrating eligible tasks in sched_balance_rq() sched/debug: Change need_resched warnings to pr_err sched/fair: Encapsulate set custom slice in a __setparam_fair() function sched: Fix race between yield_to() and try_to_wake_up() docs: Update Schedstat version to 17 sched/stats: Print domain name in /proc/schedstat sched: Move sched domain name out of CONFIG_SCHED_DEBUG sched: Report the different kinds of imbalances in /proc/schedstat ...
2 parents 858df1d + 40724ec commit 62de6e1

File tree

23 files changed

+720
-478
lines changed

23 files changed

+720
-478
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2506,7 +2506,9 @@
25062506
specified in the flag list (default: domain):
25072507

25082508
nohz
2509-
Disable the tick when a single task runs.
2509+
Disable the tick when a single task runs as well as
2510+
disabling other kernel noises like having RCU callbacks
2511+
offloaded. This is equivalent to the nohz_full parameter.
25102512

25112513
A residual 1Hz tick is offloaded to workqueues, which you
25122514
need to affine to housekeeping through the global

Documentation/scheduler/sched-stats.rst

Lines changed: 75 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,22 @@
22
Scheduler Statistics
33
====================
44

5+
Version 17 of schedstats removed 'lb_imbalance' field as it has no
6+
significance anymore and instead added more relevant fields namely
7+
'lb_imbalance_load', 'lb_imbalance_util', 'lb_imbalance_task' and
8+
'lb_imbalance_misfit'. The domain field prints the name of the
9+
corresponding sched domain from this version onwards.
10+
511
Version 16 of schedstats changed the order of definitions within
612
'enum cpu_idle_type', which changed the order of [CPU_MAX_IDLE_TYPES]
713
columns in show_schedstat(). In particular the position of CPU_IDLE
814
and __CPU_NOT_IDLE changed places. The size of the array is unchanged.
915

1016
Version 15 of schedstats dropped counters for some sched_yield:
1117
yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
12-
identical to version 14.
18+
identical to version 14. Details are available at
19+
20+
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-stats.txt?id=1e1dbb259c79b
1321

1422
Version 14 of schedstats includes support for sched_domains, which hit the
1523
mainline kernel in 2.6.20 although it is identical to the stats from version
@@ -26,7 +34,14 @@ cpus on the machine, while domain0 is the most tightly focused domain,
2634
sometimes balancing only between pairs of cpus. At this time, there
2735
are no architectures which need more than three domain levels. The first
2836
field in the domain stats is a bit map indicating which cpus are affected
29-
by that domain.
37+
by that domain. Details are available at
38+
39+
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/sched-stats.txt?id=b762f3ffb797c
40+
41+
The schedstat documentation is maintained version 10 onwards and is not
42+
updated for version 11 and 12. The details for version 10 are available at
43+
44+
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/sched-stats.txt?id=1da177e4c3f4
3045

3146
These fields are counters, and only increment. Programs which make use
3247
of these will need to start with a baseline observation and then calculate
@@ -71,88 +86,97 @@ Domain statistics
7186
-----------------
7287
One of these is produced per domain for each cpu described. (Note that if
7388
CONFIG_SMP is not defined, *no* domains are utilized and these lines
74-
will not appear in the output.)
89+
will not appear in the output. <name> is an extension to the domain field
90+
that prints the name of the corresponding sched domain. It can appear in
91+
schedstat version 17 and above, and requires CONFIG_SCHED_DEBUG.)
7592

76-
domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
93+
domain<N> <name> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
7794

7895
The first field is a bit mask indicating what cpus this domain operates over.
7996

80-
The next 24 are a variety of sched_balance_rq() statistics in grouped into types
81-
of idleness (idle, busy, and newly idle):
97+
The next 33 are a variety of sched_balance_rq() statistics in grouped into types
98+
of idleness (busy, idle and newly idle):
8299

83100
1) # of times in this domain sched_balance_rq() was called when the
101+
cpu was busy
102+
2) # of times in this domain sched_balance_rq() checked but found the
103+
load did not require balancing when busy
104+
3) # of times in this domain sched_balance_rq() tried to move one or
105+
more tasks and failed, when the cpu was busy
106+
4) Total imbalance in load when the cpu was busy
107+
5) Total imbalance in utilization when the cpu was busy
108+
6) Total imbalance in number of tasks when the cpu was busy
109+
7) Total imbalance due to misfit tasks when the cpu was busy
110+
8) # of times in this domain pull_task() was called when busy
111+
9) # of times in this domain pull_task() was called even though the
112+
target task was cache-hot when busy
113+
10) # of times in this domain sched_balance_rq() was called but did not
114+
find a busier queue while the cpu was busy
115+
11) # of times in this domain a busier queue was found while the cpu
116+
was busy but no busier group was found
117+
118+
12) # of times in this domain sched_balance_rq() was called when the
84119
cpu was idle
85-
2) # of times in this domain sched_balance_rq() checked but found
120+
13) # of times in this domain sched_balance_rq() checked but found
86121
the load did not require balancing when the cpu was idle
87-
3) # of times in this domain sched_balance_rq() tried to move one or
122+
14) # of times in this domain sched_balance_rq() tried to move one or
88123
more tasks and failed, when the cpu was idle
89-
4) sum of imbalances discovered (if any) with each call to
90-
sched_balance_rq() in this domain when the cpu was idle
91-
5) # of times in this domain pull_task() was called when the cpu
124+
15) Total imbalance in load when the cpu was idle
125+
16) Total imbalance in utilization when the cpu was idle
126+
17) Total imbalance in number of tasks when the cpu was idle
127+
18) Total imbalance due to misfit tasks when the cpu was idle
128+
19) # of times in this domain pull_task() was called when the cpu
92129
was idle
93-
6) # of times in this domain pull_task() was called even though
130+
20) # of times in this domain pull_task() was called even though
94131
the target task was cache-hot when idle
95-
7) # of times in this domain sched_balance_rq() was called but did
132+
21) # of times in this domain sched_balance_rq() was called but did
96133
not find a busier queue while the cpu was idle
97-
8) # of times in this domain a busier queue was found while the
134+
22) # of times in this domain a busier queue was found while the
98135
cpu was idle but no busier group was found
99-
9) # of times in this domain sched_balance_rq() was called when the
100-
cpu was busy
101-
10) # of times in this domain sched_balance_rq() checked but found the
102-
load did not require balancing when busy
103-
11) # of times in this domain sched_balance_rq() tried to move one or
104-
more tasks and failed, when the cpu was busy
105-
12) sum of imbalances discovered (if any) with each call to
106-
sched_balance_rq() in this domain when the cpu was busy
107-
13) # of times in this domain pull_task() was called when busy
108-
14) # of times in this domain pull_task() was called even though the
109-
target task was cache-hot when busy
110-
15) # of times in this domain sched_balance_rq() was called but did not
111-
find a busier queue while the cpu was busy
112-
16) # of times in this domain a busier queue was found while the cpu
113-
was busy but no busier group was found
114136

115-
17) # of times in this domain sched_balance_rq() was called when the
116-
cpu was just becoming idle
117-
18) # of times in this domain sched_balance_rq() checked but found the
137+
23) # of times in this domain sched_balance_rq() was called when the
138+
was just becoming idle
139+
24) # of times in this domain sched_balance_rq() checked but found the
118140
load did not require balancing when the cpu was just becoming idle
119-
19) # of times in this domain sched_balance_rq() tried to move one or more
141+
25) # of times in this domain sched_balance_rq() tried to move one or more
120142
tasks and failed, when the cpu was just becoming idle
121-
20) sum of imbalances discovered (if any) with each call to
122-
sched_balance_rq() in this domain when the cpu was just becoming idle
123-
21) # of times in this domain pull_task() was called when newly idle
124-
22) # of times in this domain pull_task() was called even though the
143+
26) Total imbalance in load when the cpu was just becoming idle
144+
27) Total imbalance in utilization when the cpu was just becoming idle
145+
28) Total imbalance in number of tasks when the cpu was just becoming idle
146+
29) Total imbalance due to misfit tasks when the cpu was just becoming idle
147+
30) # of times in this domain pull_task() was called when newly idle
148+
31) # of times in this domain pull_task() was called even though the
125149
target task was cache-hot when just becoming idle
126-
23) # of times in this domain sched_balance_rq() was called but did not
150+
32) # of times in this domain sched_balance_rq() was called but did not
127151
find a busier queue while the cpu was just becoming idle
128-
24) # of times in this domain a busier queue was found while the cpu
152+
33) # of times in this domain a busier queue was found while the cpu
129153
was just becoming idle but no busier group was found
130154

131155
Next three are active_load_balance() statistics:
132156

133-
25) # of times active_load_balance() was called
134-
26) # of times active_load_balance() tried to move a task and failed
135-
27) # of times active_load_balance() successfully moved a task
157+
34) # of times active_load_balance() was called
158+
35) # of times active_load_balance() tried to move a task and failed
159+
36) # of times active_load_balance() successfully moved a task
136160

137161
Next three are sched_balance_exec() statistics:
138162

139-
28) sbe_cnt is not used
140-
29) sbe_balanced is not used
141-
30) sbe_pushed is not used
163+
37) sbe_cnt is not used
164+
38) sbe_balanced is not used
165+
39) sbe_pushed is not used
142166

143167
Next three are sched_balance_fork() statistics:
144168

145-
31) sbf_cnt is not used
146-
32) sbf_balanced is not used
147-
33) sbf_pushed is not used
169+
40) sbf_cnt is not used
170+
41) sbf_balanced is not used
171+
42) sbf_pushed is not used
148172

149173
Next three are try_to_wake_up() statistics:
150174

151-
34) # of times in this domain try_to_wake_up() awoke a task that
175+
43) # of times in this domain try_to_wake_up() awoke a task that
152176
last ran on a different cpu in this domain
153-
35) # of times in this domain try_to_wake_up() moved a task to the
177+
44) # of times in this domain try_to_wake_up() moved a task to the
154178
waking cpu because it was cache-cold on its own cpu anyway
155-
36) # of times in this domain try_to_wake_up() started passive balancing
179+
45) # of times in this domain try_to_wake_up() started passive balancing
156180

157181
/proc/<pid>/schedstat
158182
---------------------

arch/x86/include/asm/topology.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -251,7 +251,7 @@ extern bool x86_topology_update;
251251
#include <asm/percpu.h>
252252

253253
DECLARE_PER_CPU_READ_MOSTLY(int, sched_core_priority);
254-
extern unsigned int __read_mostly sysctl_sched_itmt_enabled;
254+
extern bool __read_mostly sysctl_sched_itmt_enabled;
255255

256256
/* Interface to set priority of a cpu */
257257
void sched_set_itmt_core_prio(int prio, int core_cpu);
@@ -264,7 +264,7 @@ void sched_clear_itmt_support(void);
264264

265265
#else /* CONFIG_SCHED_MC_PRIO */
266266

267-
#define sysctl_sched_itmt_enabled 0
267+
#define sysctl_sched_itmt_enabled false
268268
static inline void sched_set_itmt_core_prio(int prio, int core_cpu)
269269
{
270270
}

arch/x86/kernel/itmt.c

Lines changed: 33 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
#include <linux/sched.h>
2020
#include <linux/cpumask.h>
2121
#include <linux/cpuset.h>
22+
#include <linux/debugfs.h>
2223
#include <linux/mutex.h>
2324
#include <linux/sysctl.h>
2425
#include <linux/nodemask.h>
@@ -34,49 +35,38 @@ static bool __read_mostly sched_itmt_capable;
3435
* of higher turbo frequency for cpus supporting Intel Turbo Boost Max
3536
* Technology 3.0.
3637
*
37-
* It can be set via /proc/sys/kernel/sched_itmt_enabled
38+
* It can be set via /sys/kernel/debug/x86/sched_itmt_enabled
3839
*/
39-
unsigned int __read_mostly sysctl_sched_itmt_enabled;
40+
bool __read_mostly sysctl_sched_itmt_enabled;
4041

41-
static int sched_itmt_update_handler(const struct ctl_table *table, int write,
42-
void *buffer, size_t *lenp, loff_t *ppos)
42+
static ssize_t sched_itmt_enabled_write(struct file *filp,
43+
const char __user *ubuf,
44+
size_t cnt, loff_t *ppos)
4345
{
44-
unsigned int old_sysctl;
45-
int ret;
46+
ssize_t result;
47+
bool orig;
4648

47-
mutex_lock(&itmt_update_mutex);
49+
guard(mutex)(&itmt_update_mutex);
4850

49-
if (!sched_itmt_capable) {
50-
mutex_unlock(&itmt_update_mutex);
51-
return -EINVAL;
52-
}
53-
54-
old_sysctl = sysctl_sched_itmt_enabled;
55-
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
51+
orig = sysctl_sched_itmt_enabled;
52+
result = debugfs_write_file_bool(filp, ubuf, cnt, ppos);
5653

57-
if (!ret && write && old_sysctl != sysctl_sched_itmt_enabled) {
54+
if (sysctl_sched_itmt_enabled != orig) {
5855
x86_topology_update = true;
5956
rebuild_sched_domains();
6057
}
6158

62-
mutex_unlock(&itmt_update_mutex);
63-
64-
return ret;
59+
return result;
6560
}
6661

67-
static struct ctl_table itmt_kern_table[] = {
68-
{
69-
.procname = "sched_itmt_enabled",
70-
.data = &sysctl_sched_itmt_enabled,
71-
.maxlen = sizeof(unsigned int),
72-
.mode = 0644,
73-
.proc_handler = sched_itmt_update_handler,
74-
.extra1 = SYSCTL_ZERO,
75-
.extra2 = SYSCTL_ONE,
76-
},
62+
static const struct file_operations dfs_sched_itmt_fops = {
63+
.read = debugfs_read_file_bool,
64+
.write = sched_itmt_enabled_write,
65+
.open = simple_open,
66+
.llseek = default_llseek,
7767
};
7868

79-
static struct ctl_table_header *itmt_sysctl_header;
69+
static struct dentry *dfs_sched_itmt;
8070

8171
/**
8272
* sched_set_itmt_support() - Indicate platform supports ITMT
@@ -97,16 +87,18 @@ static struct ctl_table_header *itmt_sysctl_header;
9787
*/
9888
int sched_set_itmt_support(void)
9989
{
100-
mutex_lock(&itmt_update_mutex);
90+
guard(mutex)(&itmt_update_mutex);
10191

102-
if (sched_itmt_capable) {
103-
mutex_unlock(&itmt_update_mutex);
92+
if (sched_itmt_capable)
10493
return 0;
105-
}
10694

107-
itmt_sysctl_header = register_sysctl("kernel", itmt_kern_table);
108-
if (!itmt_sysctl_header) {
109-
mutex_unlock(&itmt_update_mutex);
95+
dfs_sched_itmt = debugfs_create_file_unsafe("sched_itmt_enabled",
96+
0644,
97+
arch_debugfs_dir,
98+
&sysctl_sched_itmt_enabled,
99+
&dfs_sched_itmt_fops);
100+
if (IS_ERR_OR_NULL(dfs_sched_itmt)) {
101+
dfs_sched_itmt = NULL;
110102
return -ENOMEM;
111103
}
112104

@@ -117,8 +109,6 @@ int sched_set_itmt_support(void)
117109
x86_topology_update = true;
118110
rebuild_sched_domains();
119111

120-
mutex_unlock(&itmt_update_mutex);
121-
122112
return 0;
123113
}
124114

@@ -134,27 +124,22 @@ int sched_set_itmt_support(void)
134124
*/
135125
void sched_clear_itmt_support(void)
136126
{
137-
mutex_lock(&itmt_update_mutex);
127+
guard(mutex)(&itmt_update_mutex);
138128

139-
if (!sched_itmt_capable) {
140-
mutex_unlock(&itmt_update_mutex);
129+
if (!sched_itmt_capable)
141130
return;
142-
}
131+
143132
sched_itmt_capable = false;
144133

145-
if (itmt_sysctl_header) {
146-
unregister_sysctl_table(itmt_sysctl_header);
147-
itmt_sysctl_header = NULL;
148-
}
134+
debugfs_remove(dfs_sched_itmt);
135+
dfs_sched_itmt = NULL;
149136

150137
if (sysctl_sched_itmt_enabled) {
151138
/* disable sched_itmt if we are no longer ITMT capable */
152139
sysctl_sched_itmt_enabled = 0;
153140
x86_topology_update = true;
154141
rebuild_sched_domains();
155142
}
156-
157-
mutex_unlock(&itmt_update_mutex);
158143
}
159144

160145
int arch_asym_cpu_priority(int cpu)

0 commit comments

Comments
 (0)