Skip to content

Commit 3bd8346

Browse files
committed
Merge branch 'pm-em'
Merge Enery Model changes for 6.9-rc1: - Allow the Energy Model to be updated dynamically (Lukasz Luba). * pm-em: (24 commits) PM: EM: Fix nr_states warnings in static checks Documentation: EM: Update with runtime modification design PM: EM: Add em_dev_compute_costs() PM: EM: Remove old table PM: EM: Change debugfs configuration to use runtime EM table data drivers/thermal/devfreq_cooling: Use new Energy Model interface drivers/thermal/cpufreq_cooling: Use new Energy Model interface powercap/dtpm_devfreq: Use new Energy Model interface to get table powercap/dtpm_cpu: Use new Energy Model interface to get table PM: EM: Optimize em_cpu_energy() and remove division PM: EM: Support late CPUs booting and capacity adjustment PM: EM: Add performance field to struct em_perf_state and optimize PM: EM: Add em_perf_state_from_pd() to get performance states table PM: EM: Introduce em_dev_update_perf_domain() for EM updates PM: EM: Add functions for memory allocations for new EM tables PM: EM: Use runtime modified EM for CPUs energy estimation in EAS PM: EM: Introduce runtime modifiable table PM: EM: Split the allocation and initialization of the EM table PM: EM: Check if the get_cost() callback is present in em_compute_costs() PM: EM: Introduce em_compute_costs() ...
2 parents c907ab5 + 3a561ea commit 3bd8346

File tree

7 files changed

+821
-170
lines changed

7 files changed

+821
-170
lines changed

Documentation/power/energy-model.rst

Lines changed: 179 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,31 @@ whose performance is scaled together. Performance domains generally have a
7171
required to have the same micro-architecture. CPUs in different performance
7272
domains can have different micro-architectures.
7373

74+
To better reflect power variation due to static power (leakage) the EM
75+
supports runtime modifications of the power values. The mechanism relies on
76+
RCU to free the modifiable EM perf_state table memory. Its user, the task
77+
scheduler, also uses RCU to access this memory. The EM framework provides
78+
API for allocating/freeing the new memory for the modifiable EM table.
79+
The old memory is freed automatically using RCU callback mechanism when there
80+
are no owners anymore for the given EM runtime table instance. This is tracked
81+
using kref mechanism. The device driver which provided the new EM at runtime,
82+
should call EM API to free it safely when it's no longer needed. The EM
83+
framework will handle the clean-up when it's possible.
84+
85+
The kernel code which want to modify the EM values is protected from concurrent
86+
access using a mutex. Therefore, the device driver code must run in sleeping
87+
context when it tries to modify the EM.
88+
89+
With the runtime modifiable EM we switch from a 'single and during the entire
90+
runtime static EM' (system property) design to a 'single EM which can be
91+
changed during runtime according e.g. to the workload' (system and workload
92+
property) design.
93+
94+
It is possible also to modify the CPU performance values for each EM's
95+
performance state. Thus, the full power and performance profile (which
96+
is an exponential curve) can be changed according e.g. to the workload
97+
or system property.
98+
7499

75100
2. Core APIs
76101
------------
@@ -175,10 +200,82 @@ CPUfreq governor is in use in case of CPU device. Currently this calculation is
175200
not provided for other type of devices.
176201

177202
More details about the above APIs can be found in ``<linux/energy_model.h>``
178-
or in Section 2.4
203+
or in Section 2.5
204+
205+
206+
2.4 Runtime modifications
207+
^^^^^^^^^^^^^^^^^^^^^^^^^
208+
209+
Drivers willing to update the EM at runtime should use the following dedicated
210+
function to allocate a new instance of the modified EM. The API is listed
211+
below::
212+
213+
struct em_perf_table __rcu *em_table_alloc(struct em_perf_domain *pd);
214+
215+
This allows to allocate a structure which contains the new EM table with
216+
also RCU and kref needed by the EM framework. The 'struct em_perf_table'
217+
contains array 'struct em_perf_state state[]' which is a list of performance
218+
states in ascending order. That list must be populated by the device driver
219+
which wants to update the EM. The list of frequencies can be taken from
220+
existing EM (created during boot). The content in the 'struct em_perf_state'
221+
must be populated by the driver as well.
222+
223+
This is the API which does the EM update, using RCU pointers swap::
224+
225+
int em_dev_update_perf_domain(struct device *dev,
226+
struct em_perf_table __rcu *new_table);
227+
228+
Drivers must provide a pointer to the allocated and initialized new EM
229+
'struct em_perf_table'. That new EM will be safely used inside the EM framework
230+
and will be visible to other sub-systems in the kernel (thermal, powercap).
231+
The main design goal for this API is to be fast and avoid extra calculations
232+
or memory allocations at runtime. When pre-computed EMs are available in the
233+
device driver, than it should be possible to simply re-use them with low
234+
performance overhead.
235+
236+
In order to free the EM, provided earlier by the driver (e.g. when the module
237+
is unloaded), there is a need to call the API::
238+
239+
void em_table_free(struct em_perf_table __rcu *table);
240+
241+
It will allow the EM framework to safely remove the memory, when there is
242+
no other sub-system using it, e.g. EAS.
243+
244+
To use the power values in other sub-systems (like thermal, powercap) there is
245+
a need to call API which protects the reader and provide consistency of the EM
246+
table data::
247+
248+
struct em_perf_state *em_perf_state_from_pd(struct em_perf_domain *pd);
249+
250+
It returns the 'struct em_perf_state' pointer which is an array of performance
251+
states in ascending order.
252+
This function must be called in the RCU read lock section (after the
253+
rcu_read_lock()). When the EM table is not needed anymore there is a need to
254+
call rcu_real_unlock(). In this way the EM safely uses the RCU read section
255+
and protects the users. It also allows the EM framework to manage the memory
256+
and free it. More details how to use it can be found in Section 3.2 in the
257+
example driver.
258+
259+
There is dedicated API for device drivers to calculate em_perf_state::cost
260+
values::
261+
262+
int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
263+
int nr_states);
264+
265+
These 'cost' values from EM are used in EAS. The new EM table should be passed
266+
together with the number of entries and device pointer. When the computation
267+
of the cost values is done properly the return value from the function is 0.
268+
The function takes care for right setting of inefficiency for each performance
269+
state as well. It updates em_perf_state::flags accordingly.
270+
Then such prepared new EM can be passed to the em_dev_update_perf_domain()
271+
function, which will allow to use it.
272+
273+
More details about the above APIs can be found in ``<linux/energy_model.h>``
274+
or in Section 3.2 with an example code showing simple implementation of the
275+
updating mechanism in a device driver.
179276

180277

181-
2.4 Description details of this API
278+
2.5 Description details of this API
182279
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
183280
.. kernel-doc:: include/linux/energy_model.h
184281
:internal:
@@ -187,8 +284,11 @@ or in Section 2.4
187284
:export:
188285

189286

190-
3. Example driver
191-
-----------------
287+
3. Examples
288+
-----------
289+
290+
3.1 Example driver with EM registration
291+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
192292

193293
The CPUFreq framework supports dedicated callback for registering
194294
the EM for a given CPU(s) 'policy' object: cpufreq_driver::register_em().
@@ -242,3 +342,78 @@ EM framework::
242342
39 static struct cpufreq_driver foo_cpufreq_driver = {
243343
40 .register_em = foo_cpufreq_register_em,
244344
41 };
345+
346+
347+
3.2 Example driver with EM modification
348+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
349+
350+
This section provides a simple example of a thermal driver modifying the EM.
351+
The driver implements a foo_thermal_em_update() function. The driver is woken
352+
up periodically to check the temperature and modify the EM data::
353+
354+
-> drivers/soc/example/example_em_mod.c
355+
356+
01 static void foo_get_new_em(struct foo_context *ctx)
357+
02 {
358+
03 struct em_perf_table __rcu *em_table;
359+
04 struct em_perf_state *table, *new_table;
360+
05 struct device *dev = ctx->dev;
361+
06 struct em_perf_domain *pd;
362+
07 unsigned long freq;
363+
08 int i, ret;
364+
09
365+
10 pd = em_pd_get(dev);
366+
11 if (!pd)
367+
12 return;
368+
13
369+
14 em_table = em_table_alloc(pd);
370+
15 if (!em_table)
371+
16 return;
372+
17
373+
18 new_table = em_table->state;
374+
19
375+
20 rcu_read_lock();
376+
21 table = em_perf_state_from_pd(pd);
377+
22 for (i = 0; i < pd->nr_perf_states; i++) {
378+
23 freq = table[i].frequency;
379+
24 foo_get_power_perf_values(dev, freq, &new_table[i]);
380+
25 }
381+
26 rcu_read_unlock();
382+
27
383+
28 /* Calculate 'cost' values for EAS */
384+
29 ret = em_dev_compute_costs(dev, table, pd->nr_perf_states);
385+
30 if (ret) {
386+
31 dev_warn(dev, "EM: compute costs failed %d\n", ret);
387+
32 em_free_table(em_table);
388+
33 return;
389+
34 }
390+
35
391+
36 ret = em_dev_update_perf_domain(dev, em_table);
392+
37 if (ret) {
393+
38 dev_warn(dev, "EM: update failed %d\n", ret);
394+
39 em_free_table(em_table);
395+
40 return;
396+
41 }
397+
42
398+
43 /*
399+
44 * Since it's one-time-update drop the usage counter.
400+
45 * The EM framework will later free the table when needed.
401+
46 */
402+
47 em_table_free(em_table);
403+
48 }
404+
49
405+
50 /*
406+
51 * Function called periodically to check the temperature and
407+
52 * update the EM if needed
408+
53 */
409+
54 static void foo_thermal_em_update(struct foo_context *ctx)
410+
55 {
411+
56 struct device *dev = ctx->dev;
412+
57 int cpu;
413+
58
414+
59 ctx->temperature = foo_get_temp(dev, ctx);
415+
60 if (ctx->temperature < FOO_EM_UPDATE_TEMP_THRESHOLD)
416+
61 return;
417+
62
418+
63 foo_get_new_em(ctx);
419+
64 }

drivers/powercap/dtpm_cpu.c

Lines changed: 30 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
4242
{
4343
struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
4444
struct em_perf_domain *pd = em_cpu_get(dtpm_cpu->cpu);
45+
struct em_perf_state *table;
4546
struct cpumask cpus;
4647
unsigned long freq;
4748
u64 power;
@@ -50,20 +51,22 @@ static u64 set_pd_power_limit(struct dtpm *dtpm, u64 power_limit)
5051
cpumask_and(&cpus, cpu_online_mask, to_cpumask(pd->cpus));
5152
nr_cpus = cpumask_weight(&cpus);
5253

54+
rcu_read_lock();
55+
table = em_perf_state_from_pd(pd);
5356
for (i = 0; i < pd->nr_perf_states; i++) {
5457

55-
power = pd->table[i].power * nr_cpus;
58+
power = table[i].power * nr_cpus;
5659

5760
if (power > power_limit)
5861
break;
5962
}
6063

61-
freq = pd->table[i - 1].frequency;
64+
freq = table[i - 1].frequency;
65+
power_limit = table[i - 1].power * nr_cpus;
66+
rcu_read_unlock();
6267

6368
freq_qos_update_request(&dtpm_cpu->qos_req, freq);
6469

65-
power_limit = pd->table[i - 1].power * nr_cpus;
66-
6770
return power_limit;
6871
}
6972

@@ -87,9 +90,11 @@ static u64 scale_pd_power_uw(struct cpumask *pd_mask, u64 power)
8790
static u64 get_pd_power_uw(struct dtpm *dtpm)
8891
{
8992
struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
93+
struct em_perf_state *table;
9094
struct em_perf_domain *pd;
9195
struct cpumask *pd_mask;
9296
unsigned long freq;
97+
u64 power = 0;
9398
int i;
9499

95100
pd = em_cpu_get(dtpm_cpu->cpu);
@@ -98,33 +103,43 @@ static u64 get_pd_power_uw(struct dtpm *dtpm)
98103

99104
freq = cpufreq_quick_get(dtpm_cpu->cpu);
100105

106+
rcu_read_lock();
107+
table = em_perf_state_from_pd(pd);
101108
for (i = 0; i < pd->nr_perf_states; i++) {
102109

103-
if (pd->table[i].frequency < freq)
110+
if (table[i].frequency < freq)
104111
continue;
105112

106-
return scale_pd_power_uw(pd_mask, pd->table[i].power);
113+
power = scale_pd_power_uw(pd_mask, table[i].power);
114+
break;
107115
}
116+
rcu_read_unlock();
108117

109-
return 0;
118+
return power;
110119
}
111120

112121
static int update_pd_power_uw(struct dtpm *dtpm)
113122
{
114123
struct dtpm_cpu *dtpm_cpu = to_dtpm_cpu(dtpm);
115124
struct em_perf_domain *em = em_cpu_get(dtpm_cpu->cpu);
125+
struct em_perf_state *table;
116126
struct cpumask cpus;
117127
int nr_cpus;
118128

119129
cpumask_and(&cpus, cpu_online_mask, to_cpumask(em->cpus));
120130
nr_cpus = cpumask_weight(&cpus);
121131

122-
dtpm->power_min = em->table[0].power;
132+
rcu_read_lock();
133+
table = em_perf_state_from_pd(em);
134+
135+
dtpm->power_min = table[0].power;
123136
dtpm->power_min *= nr_cpus;
124137

125-
dtpm->power_max = em->table[em->nr_perf_states - 1].power;
138+
dtpm->power_max = table[em->nr_perf_states - 1].power;
126139
dtpm->power_max *= nr_cpus;
127140

141+
rcu_read_unlock();
142+
128143
return 0;
129144
}
130145

@@ -143,7 +158,7 @@ static void pd_release(struct dtpm *dtpm)
143158

144159
cpufreq_cpu_put(policy);
145160
}
146-
161+
147162
kfree(dtpm_cpu);
148163
}
149164

@@ -180,6 +195,7 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
180195
{
181196
struct dtpm_cpu *dtpm_cpu;
182197
struct cpufreq_policy *policy;
198+
struct em_perf_state *table;
183199
struct em_perf_domain *pd;
184200
char name[CPUFREQ_NAME_LEN];
185201
int ret = -ENOMEM;
@@ -216,9 +232,12 @@ static int __dtpm_cpu_setup(int cpu, struct dtpm *parent)
216232
if (ret)
217233
goto out_kfree_dtpm_cpu;
218234

235+
rcu_read_lock();
236+
table = em_perf_state_from_pd(pd);
219237
ret = freq_qos_add_request(&policy->constraints,
220238
&dtpm_cpu->qos_req, FREQ_QOS_MAX,
221-
pd->table[pd->nr_perf_states - 1].frequency);
239+
table[pd->nr_perf_states - 1].frequency);
240+
rcu_read_unlock();
222241
if (ret < 0)
223242
goto out_dtpm_unregister;
224243

0 commit comments

Comments
 (0)