Skip to content

Commit 7a9072d

Browse files
committed
Merge branch 'pm-cpuidle'
Merge cpuidle updates for 6.15-rc5, including a menu governor update that is reported to improve some benchmark results quite significantly: - Update the handling of the most recent idle intervals in the menu cpuidle governor to prevent useful information from being discarded by it in some cases and improve the prediction accuracy (Rafael Wysocki). - Make it possible to tell the intel_idle driver to ignore its built-in table of idle states for the given processor, clean up the handling of auto-demotion disabling on Baytrail and Cherrytrail chips in it, and update its MAINTAINERS entry (David Arcari, Artem Bityutskiy, Rafael Wysocki). - Make some cpuidle drivers use for_each_present_cpu() instead of for_each_possible_cpu() during initialization to avoid issues occurring when nosmp or maxcpus=0 are used (Jacky Bai). * pm-cpuidle: cpuidle: Init cpuidle only for present CPUs cpuidle: intel_idle: Update MAINTAINERS intel_idle: introduce 'no_native' module parameter cpuidle: menu: Update documentation after get_typical_interval() changes cpuidle: menu: Avoid discarding useful information cpuidle: menu: Eliminate outliers on both ends of the sample set cpuidle: menu: Tweak threshold use in get_typical_interval() cpuidle: menu: Use one loop for average and variance computations cpuidle: menu: Drop a redundant local variable intel_idle: clean up BYT/CHT auto demotion disable
2 parents 1774be7 + 68cb013 commit 7a9072d

File tree

10 files changed

+139
-100
lines changed

10 files changed

+139
-100
lines changed

Documentation/admin-guide/pm/cpuidle.rst

Lines changed: 17 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -275,20 +275,25 @@ values and, when predicting the idle duration next time, it computes the average
275275
and variance of them. If the variance is small (smaller than 400 square
276276
milliseconds) or it is small relative to the average (the average is greater
277277
that 6 times the standard deviation), the average is regarded as the "typical
278-
interval" value. Otherwise, the longest of the saved observed idle duration
278+
interval" value. Otherwise, either the longest or the shortest (depending on
279+
which one is farther from the average) of the saved observed idle duration
279280
values is discarded and the computation is repeated for the remaining ones.
281+
280282
Again, if the variance of them is small (in the above sense), the average is
281283
taken as the "typical interval" value and so on, until either the "typical
282-
interval" is determined or too many data points are disregarded, in which case
283-
the "typical interval" is assumed to equal "infinity" (the maximum unsigned
284-
integer value).
285-
286-
If the "typical interval" computed this way is long enough, the governor obtains
287-
the time until the closest timer event with the assumption that the scheduler
288-
tick will be stopped. That time, referred to as the *sleep length* in what follows,
289-
is the upper bound on the time before the next CPU wakeup. It is used to determine
290-
the sleep length range, which in turn is needed to get the sleep length correction
291-
factor.
284+
interval" is determined or too many data points are disregarded. In the latter
285+
case, if the size of the set of data points still under consideration is
286+
sufficiently large, the next idle duration is not likely to be above the largest
287+
idle duration value still in that set, so that value is taken as the predicted
288+
next idle duration. Finally, if the set of data points still under
289+
consideration is too small, no prediction is made.
290+
291+
If the preliminary prediction of the next idle duration computed this way is
292+
long enough, the governor obtains the time until the closest timer event with
293+
the assumption that the scheduler tick will be stopped. That time, referred to
294+
as the *sleep length* in what follows, is the upper bound on the time before the
295+
next CPU wakeup. It is used to determine the sleep length range, which in turn
296+
is needed to get the sleep length correction factor.
292297

293298
The ``menu`` governor maintains an array containing several correction factor
294299
values that correspond to different sleep length ranges organized so that each
@@ -302,7 +307,7 @@ to 1 the correction factor becomes (it must fall between 0 and 1 inclusive).
302307
The sleep length is multiplied by the correction factor for the range that it
303308
falls into to obtain an approximation of the predicted idle duration that is
304309
compared to the "typical interval" determined previously and the minimum of
305-
the two is taken as the idle duration prediction.
310+
the two is taken as the final idle duration prediction.
306311

307312
If the "typical interval" value is small, which means that the CPU is likely
308313
to be woken up soon enough, the sleep length computation is skipped as it may

Documentation/admin-guide/pm/intel_idle.rst

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -192,11 +192,19 @@ even if they have been enumerated (see :ref:`cpu-pm-qos` in
192192
Documentation/admin-guide/pm/cpuidle.rst).
193193
Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
194194

195-
The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle``
196-
if the kernel has been configured with ACPI support) can be set to make the
197-
driver ignore the system's ACPI tables entirely or use them for all of the
198-
recognized processor models, respectively (they both are unset by default and
199-
``use_acpi`` has no effect if ``no_acpi`` is set).
195+
The ``no_acpi``, ``use_acpi`` and ``no_native`` module parameters are
196+
recognized by ``intel_idle`` if the kernel has been configured with ACPI
197+
support. In the case that ACPI is not configured these flags have no impact
198+
on functionality.
199+
200+
``no_acpi`` - Do not use ACPI at all. Only native mode is available, no
201+
ACPI mode.
202+
203+
``use_acpi`` - No-op in ACPI mode, the driver will consult ACPI tables for
204+
C-states on/off status in native mode.
205+
206+
``no_native`` - Work only in ACPI mode, no native mode available (ignore
207+
all custom tables).
200208

201209
The value of the ``states_off`` module parameter (0 by default) represents a
202210
list of idle states to be disabled by default in the form of a bitmask.

MAINTAINERS

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11675,12 +11675,14 @@ F: Documentation/driver-api/crypto/iaa/iaa-crypto.rst
1167511675
F: drivers/crypto/intel/iaa/*
1167611676

1167711677
INTEL IDLE DRIVER
11678-
M: Jacob Pan <jacob.jun.pan@linux.intel.com>
11679-
M: Len Brown <lenb@kernel.org>
11678+
M: Rafael J. Wysocki <rafael@kernel.org>
11679+
M: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
11680+
M: Artem Bityutskiy <dedekind1@gmail.com>
11681+
R: Len Brown <lenb@kernel.org>
1168011682
L: linux-pm@vger.kernel.org
1168111683
S: Supported
1168211684
B: https://bugzilla.kernel.org
11683-
T: git git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux.git
11685+
T: git git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git
1168411686
F: drivers/idle/intel_idle.c
1168511687

1168611688
INTEL IDXD DRIVER

drivers/cpuidle/cpuidle-arm.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -137,17 +137,17 @@ static int __init arm_idle_init_cpu(int cpu)
137137
/*
138138
* arm_idle_init - Initializes arm cpuidle driver
139139
*
140-
* Initializes arm cpuidle driver for all CPUs, if any CPU fails
141-
* to register cpuidle driver then rollback to cancel all CPUs
142-
* registration.
140+
* Initializes arm cpuidle driver for all present CPUs, if any
141+
* CPU fails to register cpuidle driver then rollback to cancel
142+
* all CPUs registration.
143143
*/
144144
static int __init arm_idle_init(void)
145145
{
146146
int cpu, ret;
147147
struct cpuidle_driver *drv;
148148
struct cpuidle_device *dev;
149149

150-
for_each_possible_cpu(cpu) {
150+
for_each_present_cpu(cpu) {
151151
ret = arm_idle_init_cpu(cpu);
152152
if (ret)
153153
goto out_fail;

drivers/cpuidle/cpuidle-big_little.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -148,7 +148,7 @@ static int __init bl_idle_driver_init(struct cpuidle_driver *drv, int part_id)
148148
if (!cpumask)
149149
return -ENOMEM;
150150

151-
for_each_possible_cpu(cpu)
151+
for_each_present_cpu(cpu)
152152
if (smp_cpuid_part(cpu) == part_id)
153153
cpumask_set_cpu(cpu, cpumask);
154154

drivers/cpuidle/cpuidle-psci.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -400,7 +400,7 @@ static int psci_idle_init_cpu(struct device *dev, int cpu)
400400
/*
401401
* psci_idle_probe - Initializes PSCI cpuidle driver
402402
*
403-
* Initializes PSCI cpuidle driver for all CPUs, if any CPU fails
403+
* Initializes PSCI cpuidle driver for all present CPUs, if any CPU fails
404404
* to register cpuidle driver then rollback to cancel all CPUs
405405
* registration.
406406
*/
@@ -410,7 +410,7 @@ static int psci_cpuidle_probe(struct platform_device *pdev)
410410
struct cpuidle_driver *drv;
411411
struct cpuidle_device *dev;
412412

413-
for_each_possible_cpu(cpu) {
413+
for_each_present_cpu(cpu) {
414414
ret = psci_idle_init_cpu(&pdev->dev, cpu);
415415
if (ret)
416416
goto out_fail;

drivers/cpuidle/cpuidle-qcom-spm.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ static int spm_cpuidle_drv_probe(struct platform_device *pdev)
135135
if (ret)
136136
return dev_err_probe(&pdev->dev, ret, "set warm boot addr failed");
137137

138-
for_each_possible_cpu(cpu) {
138+
for_each_present_cpu(cpu) {
139139
ret = spm_cpuidle_register(&pdev->dev, cpu);
140140
if (ret && ret != -ENODEV) {
141141
dev_err(&pdev->dev,

drivers/cpuidle/cpuidle-riscv-sbi.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -529,8 +529,8 @@ static int sbi_cpuidle_probe(struct platform_device *pdev)
529529
return ret;
530530
}
531531

532-
/* Initialize CPU idle driver for each CPU */
533-
for_each_possible_cpu(cpu) {
532+
/* Initialize CPU idle driver for each present CPU */
533+
for_each_present_cpu(cpu) {
534534
ret = sbi_cpuidle_init_cpu(&pdev->dev, cpu);
535535
if (ret) {
536536
pr_debug("HART%ld: idle driver init failed\n",

drivers/cpuidle/governors/menu.c

Lines changed: 67 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
* the C state is required to actually break even on this cost. CPUIDLE
4242
* provides us this duration in the "target_residency" field. So all that we
4343
* need is a good prediction of how long we'll be idle. Like the traditional
44-
* menu governor, we start with the actual known "next timer event" time.
44+
* menu governor, we take the actual known "next timer event" time.
4545
*
4646
* Since there are other source of wakeups (interrupts for example) than
4747
* the next timer event, this estimation is rather optimistic. To get a
@@ -50,30 +50,21 @@
5050
* duration always was 50% of the next timer tick, the correction factor will
5151
* be 0.5.
5252
*
53-
* menu uses a running average for this correction factor, however it uses a
54-
* set of factors, not just a single factor. This stems from the realization
55-
* that the ratio is dependent on the order of magnitude of the expected
56-
* duration; if we expect 500 milliseconds of idle time the likelihood of
57-
* getting an interrupt very early is much higher than if we expect 50 micro
58-
* seconds of idle time. A second independent factor that has big impact on
59-
* the actual factor is if there is (disk) IO outstanding or not.
60-
* (as a special twist, we consider every sleep longer than 50 milliseconds
61-
* as perfect; there are no power gains for sleeping longer than this)
62-
*
63-
* For these two reasons we keep an array of 12 independent factors, that gets
64-
* indexed based on the magnitude of the expected duration as well as the
65-
* "is IO outstanding" property.
53+
* menu uses a running average for this correction factor, but it uses a set of
54+
* factors, not just a single factor. This stems from the realization that the
55+
* ratio is dependent on the order of magnitude of the expected duration; if we
56+
* expect 500 milliseconds of idle time the likelihood of getting an interrupt
57+
* very early is much higher than if we expect 50 micro seconds of idle time.
58+
* For this reason, menu keeps an array of 6 independent factors, that gets
59+
* indexed based on the magnitude of the expected duration.
6660
*
6761
* Repeatable-interval-detector
6862
* ----------------------------
6963
* There are some cases where "next timer" is a completely unusable predictor:
7064
* Those cases where the interval is fixed, for example due to hardware
71-
* interrupt mitigation, but also due to fixed transfer rate devices such as
72-
* mice.
65+
* interrupt mitigation, but also due to fixed transfer rate devices like mice.
7366
* For this, we use a different predictor: We track the duration of the last 8
74-
* intervals and if the stand deviation of these 8 intervals is below a
75-
* threshold value, we use the average of these intervals as prediction.
76-
*
67+
* intervals and use them to estimate the duration of the next one.
7768
*/
7869

7970
struct menu_device {
@@ -116,53 +107,52 @@ static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev);
116107
*/
117108
static unsigned int get_typical_interval(struct menu_device *data)
118109
{
119-
int i, divisor;
120-
unsigned int min, max, thresh, avg;
121-
uint64_t sum, variance;
122-
123-
thresh = INT_MAX; /* Discard outliers above this value */
110+
s64 value, min_thresh = -1, max_thresh = UINT_MAX;
111+
unsigned int max, min, divisor;
112+
u64 avg, variance, avg_sq;
113+
int i;
124114

125115
again:
126-
127-
/* First calculate the average of past intervals */
128-
min = UINT_MAX;
116+
/* Compute the average and variance of past intervals. */
129117
max = 0;
130-
sum = 0;
118+
min = UINT_MAX;
119+
avg = 0;
120+
variance = 0;
131121
divisor = 0;
132122
for (i = 0; i < INTERVALS; i++) {
133-
unsigned int value = data->intervals[i];
134-
if (value <= thresh) {
135-
sum += value;
136-
divisor++;
137-
if (value > max)
138-
max = value;
139-
140-
if (value < min)
141-
min = value;
142-
}
123+
value = data->intervals[i];
124+
/*
125+
* Discard the samples outside the interval between the min and
126+
* max thresholds.
127+
*/
128+
if (value <= min_thresh || value >= max_thresh)
129+
continue;
130+
131+
divisor++;
132+
133+
avg += value;
134+
variance += value * value;
135+
136+
if (value > max)
137+
max = value;
138+
139+
if (value < min)
140+
min = value;
143141
}
144142

145143
if (!max)
146144
return UINT_MAX;
147145

148-
if (divisor == INTERVALS)
149-
avg = sum >> INTERVAL_SHIFT;
150-
else
151-
avg = div_u64(sum, divisor);
152-
153-
/* Then try to determine variance */
154-
variance = 0;
155-
for (i = 0; i < INTERVALS; i++) {
156-
unsigned int value = data->intervals[i];
157-
if (value <= thresh) {
158-
int64_t diff = (int64_t)value - avg;
159-
variance += diff * diff;
160-
}
161-
}
162-
if (divisor == INTERVALS)
146+
if (divisor == INTERVALS) {
147+
avg >>= INTERVAL_SHIFT;
163148
variance >>= INTERVAL_SHIFT;
164-
else
149+
} else {
150+
do_div(avg, divisor);
165151
do_div(variance, divisor);
152+
}
153+
154+
avg_sq = avg * avg;
155+
variance -= avg_sq;
166156

167157
/*
168158
* The typical interval is obtained when standard deviation is
@@ -177,25 +167,40 @@ static unsigned int get_typical_interval(struct menu_device *data)
177167
* Use this result only if there is no timer to wake us up sooner.
178168
*/
179169
if (likely(variance <= U64_MAX/36)) {
180-
if ((((u64)avg*avg > variance*36) && (divisor * 4 >= INTERVALS * 3))
181-
|| variance <= 400) {
170+
if ((avg_sq > variance * 36 && divisor * 4 >= INTERVALS * 3) ||
171+
variance <= 400)
182172
return avg;
183-
}
184173
}
185174

186175
/*
187-
* If we have outliers to the upside in our distribution, discard
188-
* those by setting the threshold to exclude these outliers, then
176+
* If there are outliers, discard them by setting thresholds to exclude
177+
* data points at a large enough distance from the average, then
189178
* calculate the average and standard deviation again. Once we get
190-
* down to the bottom 3/4 of our samples, stop excluding samples.
179+
* down to the last 3/4 of our samples, stop excluding samples.
191180
*
192181
* This can deal with workloads that have long pauses interspersed
193182
* with sporadic activity with a bunch of short pauses.
194183
*/
195-
if ((divisor * 4) <= INTERVALS * 3)
184+
if (divisor * 4 <= INTERVALS * 3) {
185+
/*
186+
* If there are sufficiently many data points still under
187+
* consideration after the outliers have been eliminated,
188+
* returning without a prediction would be a mistake because it
189+
* is likely that the next interval will not exceed the current
190+
* maximum, so return the latter in that case.
191+
*/
192+
if (divisor >= INTERVALS / 2)
193+
return max;
194+
196195
return UINT_MAX;
196+
}
197+
198+
/* Update the thresholds for the next round. */
199+
if (avg - min > max - avg)
200+
min_thresh = min;
201+
else
202+
max_thresh = max;
197203

198-
thresh = max - 1;
199204
goto again;
200205
}
201206

0 commit comments

Comments
 (0)