Skip to content

Commit bfe8eb3

Browse files
committed
Merge tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: "Energy scheduling: - Consolidate how the max compute capacity is used in the scheduler and how we calculate the frequency for a level of utilization. - Rework interface between the scheduler and the schedutil governor - Simplify the util_est logic Deadline scheduler: - Work more towards reducing SCHED_DEADLINE starvation of low priority tasks (e.g., SCHED_OTHER) tasks when higher priority tasks monopolize CPU cycles, via the introduction of 'deadline servers' (nested/2-level scheduling). "Fair servers" to make use of this facility are not introduced yet. EEVDF: - Introduce O(1) fastpath for EEVDF task selection NUMA balancing: - Tune the NUMA-balancing vma scanning logic some more, to better distribute the probability of a particular vma getting scanned. Plus misc fixes, cleanups and updates" * tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) sched/fair: Fix tg->load when offlining a CPU sched/fair: Remove unused 'next_buddy_marked' local variable in check_preempt_wakeup_fair() sched/fair: Use all little CPUs for CPU-bound workloads sched/fair: Simplify util_est sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true) arm64/amu: Use capacity_ref_freq() to set AMU ratio cpufreq/cppc: Set the frequency used for computing the capacity cpufreq/cppc: Move and rename cppc_cpufreq_{perf_to_khz|khz_to_perf}() energy_model: Use a fixed reference frequency cpufreq/schedutil: Use a fixed reference frequency cpufreq: Use the fixed and coherent frequency for scaling capacity sched/topology: Add a new arch_scale_freq_ref() method freezer,sched: Clean saved_state when restoring it during thaw sched/fair: Update min_vruntime for reweight_entity() correctly sched/doc: Update documentation after renames and synchronize Chinese version sched/cpufreq: Rework iowait boost sched/cpufreq: Rework schedutil governor performance estimation sched/pelt: Avoid underestimation of task utilization sched/timers: Explain why idle task schedules out on remote timer enqueue sched/cpuidle: Comment about timers requirements VS idle handler ...
2 parents aac4de4 + cdb3033 commit bfe8eb3

File tree

32 files changed

+1055
-801
lines changed

32 files changed

+1055
-801
lines changed

Documentation/scheduler/sched-design-CFS.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ This is the (partial) list of the hooks:
180180
compat_yield sysctl is turned on; in that case, it places the scheduling
181181
entity at the right-most end of the red-black tree.
182182

183-
- check_preempt_curr(...)
183+
- wakeup_preempt(...)
184184

185185
This function checks if a task that entered the runnable state should
186186
preempt the currently running task.
@@ -189,10 +189,10 @@ This is the (partial) list of the hooks:
189189

190190
This function chooses the most appropriate task eligible to run next.
191191

192-
- set_curr_task(...)
192+
- set_next_task(...)
193193

194-
This function is called when a task changes its scheduling class or changes
195-
its task group.
194+
This function is called when a task changes its scheduling class, changes
195+
its task group or is scheduled.
196196

197197
- task_tick(...)
198198

Documentation/scheduler/schedutil.rst

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -90,17 +90,16 @@ For more detail see:
9090
- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
9191

9292

93-
UTIL_EST / UTIL_EST_FASTUP
94-
==========================
93+
UTIL_EST
94+
========
9595

9696
Because periodic tasks have their averages decayed while they sleep, even
9797
though when running their expected utilization will be the same, they suffer a
9898
(DVFS) ramp-up after they are running again.
9999

100100
To alleviate this (a default enabled option) UTIL_EST drives an Infinite
101101
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
102-
highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
103-
filter to instantly increase and only decay on decrease.
102+
highest. UTIL_EST filters to instantly increase and only decay on decrease.
104103

105104
A further runqueue wide sum (of runnable tasks) is maintained of:
106105

Documentation/translations/zh_CN/scheduler/sched-design-CFS.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ p->se.vruntime。一旦p->se.vruntime变得足够大,其它的任务将成为
8080
CFS使用纳秒粒度的计时,不依赖于任何jiffies或HZ的细节。因此CFS并不像之前的调度器那样
8181
有“时间片”的概念,也没有任何启发式的设计。唯一可调的参数(你需要打开CONFIG_SCHED_DEBUG)是:
8282

83-
/sys/kernel/debug/sched/min_granularity_ns
83+
/sys/kernel/debug/sched/base_slice_ns
8484

8585
它可以用来将调度器从“桌面”模式(也就是低时延)调节为“服务器”(也就是高批处理)模式。
8686
它的默认设置是适合桌面的工作负载。SCHED_BATCH也被CFS调度器模块处理。
@@ -147,17 +147,17 @@ array)。
147147
这个函数的行为基本上是出队,紧接着入队,除非compat_yield sysctl被开启。在那种情况下,
148148
它将调度实体放在红黑树的最右端。
149149

150-
- check_preempt_curr(...)
150+
- wakeup_preempt(...)
151151

152152
这个函数检查进入可运行状态的任务能否抢占当前正在运行的任务。
153153

154154
- pick_next_task(...)
155155

156156
这个函数选择接下来最适合运行的任务。
157157

158-
- set_curr_task(...)
158+
- set_next_task(...)
159159

160-
这个函数在任务改变调度类或改变任务组时被调用
160+
这个函数在任务改变调度类,改变任务组时,或者任务被调度时被调用
161161

162162
- task_tick(...)
163163

Documentation/translations/zh_CN/scheduler/schedutil.rst

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -89,16 +89,15 @@ r_cpu被定义为当前CPU的最高性能水平与系统中任何其它CPU的最
8989
- Documentation/translations/zh_CN/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
9090

9191

92-
UTIL_EST / UTIL_EST_FASTUP
93-
==========================
92+
UTIL_EST
93+
========
9494

9595
由于周期性任务的平均数在睡眠时会衰减,而在运行时其预期利用率会和睡眠前相同,
9696
因此它们在再次运行后会面临(DVFS)的上涨。
9797

9898
为了缓解这个问题,(一个默认使能的编译选项)UTIL_EST驱动一个无限脉冲响应
9999
(Infinite Impulse Response,IIR)的EWMA,“运行”值在出队时是最高的。
100-
另一个默认使能的编译选项UTIL_EST_FASTUP修改了IIR滤波器,使其允许立即增加,
101-
仅在利用率下降时衰减。
100+
UTIL_EST滤波使其在遇到更高值时立刻增加,而遇到低值时会缓慢衰减。
102101

103102
进一步,运行队列的(可运行任务的)利用率之和由下式计算:
104103

arch/arm/include/asm/topology.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
#define arch_set_freq_scale topology_set_freq_scale
1414
#define arch_scale_freq_capacity topology_get_freq_scale
1515
#define arch_scale_freq_invariant topology_scale_freq_invariant
16+
#define arch_scale_freq_ref topology_get_freq_ref
1617
#endif
1718

1819
/* Replace task scheduler's default cpu-invariant accounting */

arch/arm64/include/asm/topology.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ void update_freq_counters_refs(void);
2323
#define arch_set_freq_scale topology_set_freq_scale
2424
#define arch_scale_freq_capacity topology_get_freq_scale
2525
#define arch_scale_freq_invariant topology_scale_freq_invariant
26+
#define arch_scale_freq_ref topology_get_freq_ref
2627

2728
#ifdef CONFIG_ACPI_CPPC_LIB
2829
#define arch_init_invariance_cppc topology_init_cpu_capacity_cppc

arch/arm64/kernel/topology.c

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,12 @@ int __init parse_acpi_topology(void)
8282
#undef pr_fmt
8383
#define pr_fmt(fmt) "AMU: " fmt
8484

85-
static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale);
85+
/*
86+
* Ensure that amu_scale_freq_tick() will return SCHED_CAPACITY_SCALE until
87+
* the CPU capacity and its associated frequency have been correctly
88+
* initialized.
89+
*/
90+
static DEFINE_PER_CPU_READ_MOSTLY(unsigned long, arch_max_freq_scale) = 1UL << (2 * SCHED_CAPACITY_SHIFT);
8691
static DEFINE_PER_CPU(u64, arch_const_cycles_prev);
8792
static DEFINE_PER_CPU(u64, arch_core_cycles_prev);
8893
static cpumask_var_t amu_fie_cpus;
@@ -112,14 +117,14 @@ static inline bool freq_counters_valid(int cpu)
112117
return true;
113118
}
114119

115-
static int freq_inv_set_max_ratio(int cpu, u64 max_rate, u64 ref_rate)
120+
void freq_inv_set_max_ratio(int cpu, u64 max_rate)
116121
{
117-
u64 ratio;
122+
u64 ratio, ref_rate = arch_timer_get_rate();
118123

119124
if (unlikely(!max_rate || !ref_rate)) {
120-
pr_debug("CPU%d: invalid maximum or reference frequency.\n",
125+
WARN_ONCE(1, "CPU%d: invalid maximum or reference frequency.\n",
121126
cpu);
122-
return -EINVAL;
127+
return;
123128
}
124129

125130
/*
@@ -139,12 +144,10 @@ static int freq_inv_set_max_ratio(int cpu, u64 max_rate, u64 ref_rate)
139144
ratio = div64_u64(ratio, max_rate);
140145
if (!ratio) {
141146
WARN_ONCE(1, "Reference frequency too low.\n");
142-
return -EINVAL;
147+
return;
143148
}
144149

145-
per_cpu(arch_max_freq_scale, cpu) = (unsigned long)ratio;
146-
147-
return 0;
150+
WRITE_ONCE(per_cpu(arch_max_freq_scale, cpu), (unsigned long)ratio);
148151
}
149152

150153
static void amu_scale_freq_tick(void)
@@ -195,10 +198,7 @@ static void amu_fie_setup(const struct cpumask *cpus)
195198
return;
196199

197200
for_each_cpu(cpu, cpus) {
198-
if (!freq_counters_valid(cpu) ||
199-
freq_inv_set_max_ratio(cpu,
200-
cpufreq_get_hw_max_freq(cpu) * 1000ULL,
201-
arch_timer_get_rate()))
201+
if (!freq_counters_valid(cpu))
202202
return;
203203
}
204204

arch/riscv/include/asm/topology.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
#define arch_set_freq_scale topology_set_freq_scale
1010
#define arch_scale_freq_capacity topology_get_freq_scale
1111
#define arch_scale_freq_invariant topology_scale_freq_invariant
12+
#define arch_scale_freq_ref topology_get_freq_ref
1213

1314
/* Replace task scheduler's default cpu-invariant accounting */
1415
#define arch_scale_cpu_capacity topology_get_cpu_scale

drivers/acpi/cppc_acpi.c

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,9 @@
3939
#include <linux/rwsem.h>
4040
#include <linux/wait.h>
4141
#include <linux/topology.h>
42+
#include <linux/dmi.h>
43+
#include <linux/units.h>
44+
#include <asm/unaligned.h>
4245

4346
#include <acpi/cppc_acpi.h>
4447

@@ -1760,3 +1763,104 @@ unsigned int cppc_get_transition_latency(int cpu_num)
17601763
return latency_ns;
17611764
}
17621765
EXPORT_SYMBOL_GPL(cppc_get_transition_latency);
1766+
1767+
/* Minimum struct length needed for the DMI processor entry we want */
1768+
#define DMI_ENTRY_PROCESSOR_MIN_LENGTH 48
1769+
1770+
/* Offset in the DMI processor structure for the max frequency */
1771+
#define DMI_PROCESSOR_MAX_SPEED 0x14
1772+
1773+
/* Callback function used to retrieve the max frequency from DMI */
1774+
static void cppc_find_dmi_mhz(const struct dmi_header *dm, void *private)
1775+
{
1776+
const u8 *dmi_data = (const u8 *)dm;
1777+
u16 *mhz = (u16 *)private;
1778+
1779+
if (dm->type == DMI_ENTRY_PROCESSOR &&
1780+
dm->length >= DMI_ENTRY_PROCESSOR_MIN_LENGTH) {
1781+
u16 val = (u16)get_unaligned((const u16 *)
1782+
(dmi_data + DMI_PROCESSOR_MAX_SPEED));
1783+
*mhz = val > *mhz ? val : *mhz;
1784+
}
1785+
}
1786+
1787+
/* Look up the max frequency in DMI */
1788+
static u64 cppc_get_dmi_max_khz(void)
1789+
{
1790+
u16 mhz = 0;
1791+
1792+
dmi_walk(cppc_find_dmi_mhz, &mhz);
1793+
1794+
/*
1795+
* Real stupid fallback value, just in case there is no
1796+
* actual value set.
1797+
*/
1798+
mhz = mhz ? mhz : 1;
1799+
1800+
return KHZ_PER_MHZ * mhz;
1801+
}
1802+
1803+
/*
1804+
* If CPPC lowest_freq and nominal_freq registers are exposed then we can
1805+
* use them to convert perf to freq and vice versa. The conversion is
1806+
* extrapolated as an affine function passing by the 2 points:
1807+
* - (Low perf, Low freq)
1808+
* - (Nominal perf, Nominal freq)
1809+
*/
1810+
unsigned int cppc_perf_to_khz(struct cppc_perf_caps *caps, unsigned int perf)
1811+
{
1812+
s64 retval, offset = 0;
1813+
static u64 max_khz;
1814+
u64 mul, div;
1815+
1816+
if (caps->lowest_freq && caps->nominal_freq) {
1817+
mul = caps->nominal_freq - caps->lowest_freq;
1818+
mul *= KHZ_PER_MHZ;
1819+
div = caps->nominal_perf - caps->lowest_perf;
1820+
offset = caps->nominal_freq * KHZ_PER_MHZ -
1821+
div64_u64(caps->nominal_perf * mul, div);
1822+
} else {
1823+
if (!max_khz)
1824+
max_khz = cppc_get_dmi_max_khz();
1825+
mul = max_khz;
1826+
div = caps->highest_perf;
1827+
}
1828+
1829+
retval = offset + div64_u64(perf * mul, div);
1830+
if (retval >= 0)
1831+
return retval;
1832+
return 0;
1833+
}
1834+
EXPORT_SYMBOL_GPL(cppc_perf_to_khz);
1835+
1836+
unsigned int cppc_khz_to_perf(struct cppc_perf_caps *caps, unsigned int freq)
1837+
{
1838+
s64 retval, offset = 0;
1839+
static u64 max_khz;
1840+
u64 mul, div;
1841+
1842+
if (caps->lowest_freq && caps->nominal_freq) {
1843+
mul = caps->nominal_perf - caps->lowest_perf;
1844+
div = caps->nominal_freq - caps->lowest_freq;
1845+
/*
1846+
* We don't need to convert to kHz for computing offset and can
1847+
* directly use nominal_freq and lowest_freq as the div64_u64
1848+
* will remove the frequency unit.
1849+
*/
1850+
offset = caps->nominal_perf -
1851+
div64_u64(caps->nominal_freq * mul, div);
1852+
/* But we need it for computing the perf level. */
1853+
div *= KHZ_PER_MHZ;
1854+
} else {
1855+
if (!max_khz)
1856+
max_khz = cppc_get_dmi_max_khz();
1857+
mul = caps->highest_perf;
1858+
div = max_khz;
1859+
}
1860+
1861+
retval = offset + div64_u64(freq * mul, div);
1862+
if (retval >= 0)
1863+
return retval;
1864+
return 0;
1865+
}
1866+
EXPORT_SYMBOL_GPL(cppc_khz_to_perf);

0 commit comments

Comments
 (0)