Skip to content

Commit 07abb19

Browse files
committed
Merge tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki: "From the functional perspective, the most significant change here is the addition of support for Energy Models that can be updated dynamically at run time. There is also the addition of LZ4 compression support for hibernation, the new preferred core support in amd-pstate, new platforms support in the Intel RAPL driver, new model-specific EPP handling in intel_pstate and more. Apart from that, the cpufreq default transition delay is reduced from 10 ms to 2 ms (along with some related adjustments), the system suspend statistics code undergoes a significant rework and there is a usual bunch of fixes and code cleanups all over. Specifics: - Allow the Energy Model to be updated dynamically (Lukasz Luba) - Add support for LZ4 compression algorithm to the hibernation image creation and loading code (Nikhil V) - Fix and clean up system suspend statistics collection (Rafael Wysocki) - Simplify device suspend and resume handling in the power management core code (Rafael Wysocki) - Fix PCI hibernation support description (Yiwei Lin) - Make hibernation take set_memory_ro() return values into account as appropriate (Christophe Leroy) - Set mem_sleep_current during kernel command line setup to avoid an ordering issue with handling it (Maulik Shah) - Fix wake IRQs handling when pm_runtime_force_suspend() is used as a driver's system suspend callback (Qingliang Li) - Simplify pm_runtime_get_if_active() usage and add a replacement for pm_runtime_put_autosuspend() (Sakari Ailus) - Add a tracepoint for runtime_status changes tracking (Vilas Bhat) - Fix section title markdown in the runtime PM documentation (Yiwei Lin) - Enable preferred core support in the amd-pstate cpufreq driver (Meng Li) - Fix min_perf assignment in amd_pstate_adjust_perf() and make the min/max limit perf values in amd-pstate always stay within the (highest perf, lowest perf) range (Tor Vic, Meng Li) - Allow intel_pstate to assign model-specific values to strings used in the EPP sysfs interface and make it do so on Meteor Lake (Srinivas Pandruvada) - Drop long-unused cpudata::prev_cummulative_iowait from the intel_pstate cpufreq driver (Jiri Slaby) - Prevent scaling_cur_freq from exceeding scaling_max_freq when the latter is an inefficient frequency (Shivnandan Kumar) - Change default transition delay in cpufreq to 2ms (Qais Yousef) - Remove references to 10ms minimum sampling rate from comments in the cpufreq code (Pierre Gondois) - Honour transition_latency over transition_delay_us in cpufreq (Qais Yousef) - Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar) - General enhancements / cleanups to ARM cpufreq drivers (tianyu2, Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia Belova) - Update cpufreq-dt-platdev to block/approve devices (Richard Acayan) - Make the SCMI cpufreq driver get a transition delay value from firmware (Pierre Gondois) - Prevent the haltpoll cpuidle governor from shrinking guest poll_limit_ns below grow_start (Parshuram Sangle) - Avoid potential overflow in integer multiplication when computing cpuidle state parameters (C Cheng) - Adjust MWAIT hint target C-state computation in the ACPI cpuidle driver and in intel_idle to return a correct value for C0 (He Rongguang) - Address multiple issues in the TPMI RAPL driver and add support for new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui) - Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel Lezcano) - Fix kernel-doc for dtpm_create_hierarchy() (Yang Li) - Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth Norway Ananda) - Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil) - Fix a couple of warnings in the OPP core code related to W=1 builds (Viresh Kumar) - Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh Kumar) - Extend dev_pm_opp_data with turbo support (Sibi Sankar) - dt-bindings: drop maxItems from inner items (David Heidelberg)" * tag 'pm-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (95 commits) dt-bindings: opp: drop maxItems from inner items OPP: debugfs: Fix warning around icc_get_name() OPP: debugfs: Fix warning with W=1 builds cpufreq: Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h OPP: Extend dev_pm_opp_data with turbo support Fix cpupower-frequency-info.1 man page typo cpufreq: scmi: Set transition_delay_us firmware: arm_scmi: Populate fast channel rate_limit firmware: arm_scmi: Populate perf commands rate_limit cpuidle: ACPI/intel: fix MWAIT hint target C-state computation PM: sleep: wakeirq: fix wake irq warning in system suspend powercap: dtpm: Fix kernel-doc for dtpm_create_hierarchy() function cpufreq: Don't unregister cpufreq cooling on CPU hotplug PM: suspend: Set mem_sleep_current during kernel command line setup cpufreq: Honour transition_latency over transition_delay_us cpufreq: Limit resolving a frequency to policy min/max Documentation: PM: Fix runtime_pm.rst markdown syntax cpufreq: amd-pstate: adjust min/max limit perf cpufreq: Remove references to 10ms min sampling rate cpufreq: intel_pstate: Update default EPPs for Meteor Lake ...
2 parents a070a08 + 866b554 commit 07abb19

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+2115
-725
lines changed

Documentation/admin-guide/kernel-parameters.txt

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -374,6 +374,11 @@
374374
selects a performance level in this range and appropriate
375375
to the current workload.
376376

377+
amd_prefcore=
378+
[X86]
379+
disable
380+
Disable amd-pstate preferred core.
381+
377382
amijoy.map= [HW,JOY] Amiga joystick support
378383
Map of devices attached to JOY0DAT and JOY1DAT
379384
Format: <a>,<b>
@@ -1760,6 +1765,17 @@
17601765
(that will set all pages holding image data
17611766
during restoration read-only).
17621767

1768+
hibernate.compressor= [HIBERNATION] Compression algorithm to be
1769+
used with hibernation.
1770+
Format: { lzo | lz4 }
1771+
Default: lzo
1772+
1773+
lzo: Select LZO compression algorithm to
1774+
compress/decompress hibernation image.
1775+
1776+
lz4: Select LZ4 compression algorithm to
1777+
compress/decompress hibernation image.
1778+
17631779
highmem=nn[KMG] [KNL,BOOT,EARLY] forces the highmem zone to have an exact
17641780
size of <nn>. This works even on boxes that have no
17651781
highmem otherwise. This also works to reduce highmem

Documentation/admin-guide/pm/amd-pstate.rst

Lines changed: 57 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -300,8 +300,8 @@ platforms. The AMD P-States mechanism is the more performance and energy
300300
efficiency frequency management method on AMD processors.
301301

302302

303-
AMD Pstate Driver Operation Modes
304-
=================================
303+
``amd-pstate`` Driver Operation Modes
304+
======================================
305305

306306
``amd_pstate`` CPPC has 3 operation modes: autonomous (active) mode,
307307
non-autonomous (passive) mode and guided autonomous (guided) mode.
@@ -353,6 +353,48 @@ is activated. In this mode, driver requests minimum and maximum performance
353353
level and the platform autonomously selects a performance level in this range
354354
and appropriate to the current workload.
355355

356+
``amd-pstate`` Preferred Core
357+
=================================
358+
359+
The core frequency is subjected to the process variation in semiconductors.
360+
Not all cores are able to reach the maximum frequency respecting the
361+
infrastructure limits. Consequently, AMD has redefined the concept of
362+
maximum frequency of a part. This means that a fraction of cores can reach
363+
maximum frequency. To find the best process scheduling policy for a given
364+
scenario, OS needs to know the core ordering informed by the platform through
365+
highest performance capability register of the CPPC interface.
366+
367+
``amd-pstate`` preferred core enables the scheduler to prefer scheduling on
368+
cores that can achieve a higher frequency with lower voltage. The preferred
369+
core rankings can dynamically change based on the workload, platform conditions,
370+
thermals and ageing.
371+
372+
The priority metric will be initialized by the ``amd-pstate`` driver. The ``amd-pstate``
373+
driver will also determine whether or not ``amd-pstate`` preferred core is
374+
supported by the platform.
375+
376+
``amd-pstate`` driver will provide an initial core ordering when the system boots.
377+
The platform uses the CPPC interfaces to communicate the core ranking to the
378+
operating system and scheduler to make sure that OS is choosing the cores
379+
with highest performance firstly for scheduling the process. When ``amd-pstate``
380+
driver receives a message with the highest performance change, it will
381+
update the core ranking and set the cpu's priority.
382+
383+
``amd-pstate`` Preferred Core Switch
384+
=====================================
385+
Kernel Parameters
386+
-----------------
387+
388+
``amd-pstate`` peferred core`` has two states: enable and disable.
389+
Enable/disable states can be chosen by different kernel parameters.
390+
Default enable ``amd-pstate`` preferred core.
391+
392+
``amd_prefcore=disable``
393+
394+
For systems that support ``amd-pstate`` preferred core, the core rankings will
395+
always be advertised by the platform. But OS can choose to ignore that via the
396+
kernel parameter ``amd_prefcore=disable``.
397+
356398
User Space Interface in ``sysfs`` - General
357399
===========================================
358400

@@ -385,6 +427,19 @@ control its functionality at the system level. They are located in the
385427
to the operation mode represented by that string - or to be
386428
unregistered in the "disable" case.
387429

430+
``prefcore``
431+
Preferred core state of the driver: "enabled" or "disabled".
432+
433+
"enabled"
434+
Enable the ``amd-pstate`` preferred core.
435+
436+
"disabled"
437+
Disable the ``amd-pstate`` preferred core
438+
439+
440+
This attribute is read-only to check the state of preferred core set
441+
by the kernel parameter.
442+
388443
``cpupower`` tool support for ``amd-pstate``
389444
===============================================
390445

Documentation/devicetree/bindings/opp/opp-v2-base.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,8 +57,6 @@ patternProperties:
5757
specific binding.
5858
minItems: 1
5959
maxItems: 32
60-
items:
61-
maxItems: 1
6260

6361
opp-microvolt:
6462
description: |

Documentation/power/energy-model.rst

Lines changed: 179 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,31 @@ whose performance is scaled together. Performance domains generally have a
7171
required to have the same micro-architecture. CPUs in different performance
7272
domains can have different micro-architectures.
7373

74+
To better reflect power variation due to static power (leakage) the EM
75+
supports runtime modifications of the power values. The mechanism relies on
76+
RCU to free the modifiable EM perf_state table memory. Its user, the task
77+
scheduler, also uses RCU to access this memory. The EM framework provides
78+
API for allocating/freeing the new memory for the modifiable EM table.
79+
The old memory is freed automatically using RCU callback mechanism when there
80+
are no owners anymore for the given EM runtime table instance. This is tracked
81+
using kref mechanism. The device driver which provided the new EM at runtime,
82+
should call EM API to free it safely when it's no longer needed. The EM
83+
framework will handle the clean-up when it's possible.
84+
85+
The kernel code which want to modify the EM values is protected from concurrent
86+
access using a mutex. Therefore, the device driver code must run in sleeping
87+
context when it tries to modify the EM.
88+
89+
With the runtime modifiable EM we switch from a 'single and during the entire
90+
runtime static EM' (system property) design to a 'single EM which can be
91+
changed during runtime according e.g. to the workload' (system and workload
92+
property) design.
93+
94+
It is possible also to modify the CPU performance values for each EM's
95+
performance state. Thus, the full power and performance profile (which
96+
is an exponential curve) can be changed according e.g. to the workload
97+
or system property.
98+
7499

75100
2. Core APIs
76101
------------
@@ -175,10 +200,82 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is
175200
not provided for other type of devices.
176201

177202
More details about the above APIs can be found in ``<linux/energy_model.h>``
178-
or in Section 2.4
203+
or in Section 2.5
204+
205+
206+
2.4 Runtime modifications
207+
^^^^^^^^^^^^^^^^^^^^^^^^^
208+
209+
Drivers willing to update the EM at runtime should use the following dedicated
210+
function to allocate a new instance of the modified EM. The API is listed
211+
below::
212+
213+
struct em_perf_table __rcu *em_table_alloc(struct em_perf_domain *pd);
214+
215+
This allows to allocate a structure which contains the new EM table with
216+
also RCU and kref needed by the EM framework. The 'struct em_perf_table'
217+
contains array 'struct em_perf_state state[]' which is a list of performance
218+
states in ascending order. That list must be populated by the device driver
219+
which wants to update the EM. The list of frequencies can be taken from
220+
existing EM (created during boot). The content in the 'struct em_perf_state'
221+
must be populated by the driver as well.
222+
223+
This is the API which does the EM update, using RCU pointers swap::
224+
225+
int em_dev_update_perf_domain(struct device *dev,
226+
struct em_perf_table __rcu *new_table);
227+
228+
Drivers must provide a pointer to the allocated and initialized new EM
229+
'struct em_perf_table'. That new EM will be safely used inside the EM framework
230+
and will be visible to other sub-systems in the kernel (thermal, powercap).
231+
The main design goal for this API is to be fast and avoid extra calculations
232+
or memory allocations at runtime. When pre-computed EMs are available in the
233+
device driver, than it should be possible to simply re-use them with low
234+
performance overhead.
235+
236+
In order to free the EM, provided earlier by the driver (e.g. when the module
237+
is unloaded), there is a need to call the API::
238+
239+
void em_table_free(struct em_perf_table __rcu *table);
240+
241+
It will allow the EM framework to safely remove the memory, when there is
242+
no other sub-system using it, e.g. EAS.
243+
244+
To use the power values in other sub-systems (like thermal, powercap) there is
245+
a need to call API which protects the reader and provide consistency of the EM
246+
table data::
247+
248+
struct em_perf_state *em_perf_state_from_pd(struct em_perf_domain *pd);
249+
250+
It returns the 'struct em_perf_state' pointer which is an array of performance
251+
states in ascending order.
252+
This function must be called in the RCU read lock section (after the
253+
rcu_read_lock()). When the EM table is not needed anymore there is a need to
254+
call rcu_real_unlock(). In this way the EM safely uses the RCU read section
255+
and protects the users. It also allows the EM framework to manage the memory
256+
and free it. More details how to use it can be found in Section 3.2 in the
257+
example driver.
258+
259+
There is dedicated API for device drivers to calculate em_perf_state::cost
260+
values::
261+
262+
int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
263+
int nr_states);
264+
265+
These 'cost' values from EM are used in EAS. The new EM table should be passed
266+
together with the number of entries and device pointer. When the computation
267+
of the cost values is done properly the return value from the function is 0.
268+
The function takes care for right setting of inefficiency for each performance
269+
state as well. It updates em_perf_state::flags accordingly.
270+
Then such prepared new EM can be passed to the em_dev_update_perf_domain()
271+
function, which will allow to use it.
272+
273+
More details about the above APIs can be found in ``<linux/energy_model.h>``
274+
or in Section 3.2 with an example code showing simple implementation of the
275+
updating mechanism in a device driver.
179276

180277

181-
2.4 Description details of this API
278+
2.5 Description details of this API
182279
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
183280
.. kernel-doc:: include/linux/energy_model.h
184281
:internal:
@@ -187,8 +284,11 @@ or in Section 2.4
187284
:export:
188285

189286

190-
3. Example driver
191-
-----------------
287+
3. Examples
288+
-----------
289+
290+
3.1 Example driver with EM registration
291+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
192292

193293
The CPUFreq framework supports dedicated callback for registering
194294
the EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em().
@@ -242,3 +342,78 @@ EM framework::
242342
39 static struct cpufreq_driver foo_cpufreq_driver = {
243343
40 .register_em = foo_cpufreq_register_em,
244344
41 };
345+
346+
347+
3.2 Example driver with EM modification
348+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
349+
350+
This section provides a simple example of a thermal driver modifying the EM.
351+
The driver implements a foo_thermal_em_update() function. The driver is woken
352+
up periodically to check the temperature and modify the EM data::
353+
354+
-> drivers/soc/example/example_em_mod.c
355+
356+
01 static void foo_get_new_em(struct foo_context *ctx)
357+
02 {
358+
03 struct em_perf_table __rcu *em_table;
359+
04 struct em_perf_state *table, *new_table;
360+
05 struct device *dev = ctx->dev;
361+
06 struct em_perf_domain *pd;
362+
07 unsigned long freq;
363+
08 int i, ret;
364+
09
365+
10 pd = em_pd_get(dev);
366+
11 if (!pd)
367+
12 return;
368+
13
369+
14 em_table = em_table_alloc(pd);
370+
15 if (!em_table)
371+
16 return;
372+
17
373+
18 new_table = em_table->state;
374+
19
375+
20 rcu_read_lock();
376+
21 table = em_perf_state_from_pd(pd);
377+
22 for (i = 0; i < pd->nr_perf_states; i++) {
378+
23 freq = table[i].frequency;
379+
24 foo_get_power_perf_values(dev, freq, &new_table[i]);
380+
25 }
381+
26 rcu_read_unlock();
382+
27
383+
28 /* Calculate 'cost' values for EAS */
384+
29 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states);
385+
30 if (ret) {
386+
31 dev_warn(dev, "EM: compute costs failed %d\n", ret);
387+
32 em_free_table(em_table);
388+
33 return;
389+
34 }
390+
35
391+
36 ret = em_dev_update_perf_domain(dev, em_table);
392+
37 if (ret) {
393+
38 dev_warn(dev, "EM: update failed %d\n", ret);
394+
39 em_free_table(em_table);
395+
40 return;
396+
41 }
397+
42
398+
43 /*
399+
44 * Since it's one-time-update drop the usage counter.
400+
45 * The EM framework will later free the table when needed.
401+
46 */
402+
47 em_table_free(em_table);
403+
48 }
404+
49
405+
50 /*
406+
51 * Function called periodically to check the temperature and
407+
52 * update the EM if needed
408+
53 */
409+
54 static void foo_thermal_em_update(struct foo_context *ctx)
410+
55 {
411+
56 struct device *dev = ctx->dev;
412+
57 int cpu;
413+
58
414+
59 ctx->temperature = foo_get_temp(dev, ctx);
415+
60 if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD)
416+
61 return;
417+
62
418+
63 foo_get_new_em(ctx);
419+
64 }

Documentation/power/opp.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -305,7 +305,7 @@ dev_pm_opp_get_opp_count
305305
{
306306
/* Do things */
307307
num_available = dev_pm_opp_get_opp_count(dev);
308-
speeds = kzalloc(sizeof(u32) * num_available, GFP_KERNEL);
308+
speeds = kcalloc(num_available, sizeof(u32), GFP_KERNEL);
309309
/* populate the table in increasing order */
310310
freq = 0;
311311
while (!IS_ERR(opp = dev_pm_opp_find_freq_ceil(dev, &freq))) {

Documentation/power/pci.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -625,7 +625,7 @@ The PCI subsystem-level callbacks they correspond to::
625625
pci_pm_poweroff()
626626
pci_pm_poweroff_noirq()
627627

628-
work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively,
628+
work in analogy with pci_pm_suspend() and pci_pm_suspend_noirq(), respectively,
629629
although they don't attempt to save the device's standard configuration
630630
registers.
631631

0 commit comments

Comments
 (0)