[RFC] CPU Frequency Scaling Subsystem

### Problem Description

Zephyr currently lacks support for CPU frequency scaling. As a result, SoCs that offer this feature cannot leverage it through the Zephyr kernel. Developers wishing to use CPU frequency scaling must then resort to using bare-metal APIs and implement this power management feature manually at their application level.
This introduces issues:
-	Portability: The developers application becomes coupled to their SoC, a violation of Zephyr’s promise of portability
-	Complexity: The developer must now manage the complexity of performance and power management themselves


### Proposed Change (Summary)

Introducing a CPU Frequency Scaling (CPUFreq) subsystem to Zephyr will standardize performance management. This subsystem will shift the responsibility of managing CPU frequency from the application layer to the OS, improving portability and reducing complexity for developers wishing to leverage this feature of their SoC/MCU. This subsystem should behave similarly to the Power Management subsystem and transparently make performance decisions for the device running Zephyr, given a policy.

To support this, the following capabilities must be supported:
-	CPU Load Tracking: The kernel must track the CPU utilization over time -> Informing performance decisions
-	Frequency Scaling Mechanism: The CPUFreq subsystem should evaluate the CPU load, select an appropriate performance state (P-state) from a policy and adjust the CPU frequency/voltage configuration accordingly (logic implemented by a driver)

This addition will enable Zephyr to support dynamic performance scaling on capable SoCs, enabling users to transparently make use of those features.


### Proposed Change (Detailed)

**Background**
Many modern microcontrollers and system-on-chip’s are equipped with dynamic frequency and voltage scaling (DFVS) to adjust their configuration depending on workload. This enables a balance between performance and energy efficiency.

Since Zephyr lacks the capability to leverage this feature, it limits the ability of developers, using Zephyr, to take full advantage of their hardware, especially in power sensitive applicates such as:
- 	Battery-powered IoT devices, where battery life is critical
- 	Wearables and sensors, which may operate under temperature or energy constraints
- 	Edge computing nodes, where workloads may be bursty, but power consumption is still of concern

The proposed implementation of this subsystem is based on previous art of Linux and its CPUFreq subsystem ([CPU Performance Scaling — The Linux Kernel documentation](https://www.kernel.org/doc/html/latest/admin-guide/pm/cpufreq.html)). 

Definitions
P-state -> “Performance State”, a static CPU frequency/voltage configuration that is defined by the SoC

**Overview**
To support CPU Frequency Scaling, the following must be added:
- 	CPU Load Detection: The Zephyr kernel must track the idle time and load of the CPU
- 	CPU Frequency Scaling Subsystem: Scalable and portable framework for CPU Frequency Scaling
    - Method of Selecting Appropriate P-state: Policy algorithm(s) for determining appropriate P-state
    - Method of Switching into P-state: Infrastructure for implementing the SoC determined ‘CPUFreq’ drivers
    - Devicetree Definitions: An SoC must describe what P-states are available and what the characteristics of each is
 
The Linux kernel’s architecture for this feature contains 3 software components.
- 	Core: Common code infrastructure for user space interfaces for all platforms that support CPU performance scaling
- 	Scaling Governors: Algorithms to estimate the required CPU capacity. Each governor implements one scaling algorithm
- 	Scaling Drivers: Talk to hardware and provide scaling governors with information on the available P-states and access to platform-specific hardware interfaces to change CPU P-state

These software layers allow for any policy algorithm to work with any scaling driver, decoupling the software feature from the hardware specific details. This proposal will follow the same architectural design, ensuring portability of the feature between SoCs. In the proposed system-level diagram, shown later, I will map the layers suggested in this proposal to their Linux equivalents (they are strikingly similar).

This proposal will use the ‘on-demand’ governor policy from Linux as the initial implemented policy. This subsystem is designed such that additional policies can be added in a modular manner in the future. On-demand is chosen as the initial policy due to the simplicity of ‘first fit’ which should generally provide effective frequency scaling in response to computational load. 

**CPU load detection**
The Zephyr kernel currently supports CPU load calculation as part of the debug subsystem. CPU load is calculated by leveraging tracing hooks, which are called before and after a CPU goes into idle, like so:
 
![Image](https://github.com/user-attachments/assets/3449e3ca-89fc-45fa-aa26-bf0821b2537c)

CPU load also considers the time spent in the interrupt context as well. The counter by default uses k_cycle_get_32() however this can be replaced with another counter if needed in the device trees.

_Reference: [CPU load — Zephyr Project Documentation](https://docs.zephyrproject.org/latest/services/debugging/cpu_load.html)_

As part of this feature, CPU load detection should be promoted to its own subsystem and decoupled from the debug subsystem, enabling it to be used in a production context. This subsystem can run as a workqueue and have its frequency configured via Kconfig option by the user, as explained in the next section.

**Calculating CPU Load Periodically**
The calculation for CPU load cannot simply be calculated by  `(idle_time)/(k_uptime())`, as the CPU load will show as being disproportionately low as the lifetime of the device (denominator) increases. This will result in the device being unresponsive to bursts of activity. Instead, CPU load should be calculated on a periodic basis, considering the change in idle time since the last calculation occurred.

The current code for CPU load relies on the caller of the `cpu_load_get` function to reset the count. So, in the promotion of this logic to its own subsystem, we must add the behavior that the CPU load be evaluated at some set frequency. This can be achieved by registering a workqueue job to run `cpu_load_get`, resetting the load count at some set interval (tuned to the user’s system requirements). A new API to the CPU load subsystem will be `cpu_load_get_latest()` which will return the most recently calculated CPU load over the previous period. This value will be relied upon in the CPU frequency subsystem to make performance mode transitions.

Calculating the CPU load on a rolling basis should leverage a workqueue. The frequency of the job should run periodically as configured by a Kconfig. That means CPU load % can be interpreted as ‘CPU load over the past CONFIG_CPU_LOAD_PERIOD-ms’. This Kconfig can be modified by users, to tune the responsiveness of the CPU load subsystem.

Pseudo code for CPU load subsystem:
```
cpu_load = 0;

/* API called by user or other services */
cpu_load_get_latest(bool reset)
{
    temp = cpu_load
    if (reset) {
        cpu_load = 0;
    }
    return temp;
}

cpu_load_workqueue()
{
    while(true) {
        /* Get the current CPU load as calculated by tracing hooks,
         * resetting the count once the value has been read. */
        cpu_load = cpu_load_get_latest(true);

        /* Sleep until it's time to calculate CPU load again */
        sleep(CONFIG_CPU_LOAD_PERIOD);
    }
}
```

**CPU Frequency Scaling Subsystem**
CPUFreq will be implemented as a thread running every CONFIG_CPUFREQ_INTERVAL_MS. During runtime, CPUFreq will collect the current CPU load, determine the next P-state, set that P-state and update the Kernel to the clock frequency change.

Pseudo code for the CPUFreq thread is as follows:
```
#include <zephyr/cpu_load.h>

/* P-state type, to hold Devicetree properties */
p_state state;

while(true) {
    /* Get the next P-state given registered policy algorithm */
    cpufreq_p_state_policy_next_state(get_cpu_load(), &state);

    /* Use SoC implemented state_set function to set P-state */
    cpufreq_state_set(state.name);

    sleep(CONFIG_CPUFREQ_INTERVAL_MS);
}
```

**Default P-state Policy (based on _on-demand governor_ from Linux)**
The default P-state policy to be implemented will be based on the on-demand governor policy from the Linux kernel, since it is the simplest to implement and requires the least amount of information about the given p-state.

The on-demand policy simply iterates through the available P-states and chooses the first P-state which satisfies the current CPU-load of the system. This relies on the proposed devicetree property ‘trigger-threshold’

Note: The logic behind this policy is much like pm_policy_next_state()

```
cpufreq_p_state_policy_next_state(cpu_load)
{
   for (i = num_P_states - 1; i >= 0; i--) {
      state = &p_states[i];
      if (cpu_load >= state.threshold) {
         return state;
      }
   }
}
```
The above loop can be constructed using the available states defined from the devicetree and will follow the same logic as is defined in pm.c.

**Updating the P-state**
When the next P-state is chosen, it will follow a similar path as the pm_state_set functionality of the power management subsystem. Each SoC will implement a p-state.c file, typically located under soc/<vendor>/<chip_family>/ to hold the SoC definition of p-state.c, which will facilitate the transfer to the selected P-state and call the relevant HaL functions.

Example implementation of p-state.c: ps_state_set()
```
cpufreq_state_set(p_state state)
{
   switch (state.name) {
   case "Turbo":
      LOG_DBG("Activating Turbo P-state");
      /* Calls to SoC HAL to activate Turbo mode */
      break;
   default:
      LOG_DBG("Unsupported P-state");
      return;
   }
}
```

**P-state Devicetree Properties**
In the SoC .dtsi files, each CPU should define the available P-states for the processor, as well as the CPU load threshold at which it makes sense for the P-state to be invoked.

Example devicetree property definition:
```
properties:
   name:
      type: string
      description: Name of the P-state
   trigger-threshold:
      type: int
      description: The CPU load % required to enable this P-state
```
By defining each P-state as a devicetree property, the SoC vendor can supply default threshold values to the subsystem, which may be overlayed by the developer.
```
performance-state {
   &turbo {
      status = "disabled";
   };
   &lp1 {
      trigger-threshold = <20>;
   };
};
```
The above overlay snippet shows an example of a user who is disabling the turbo performance state and tuning low-power to trigger at a different threshold as is set by default.

**System Level Diagram**
Given the current proposal, the system flow diagram is as follows:

![Image](https://github.com/user-attachments/assets/09799c7e-6190-4533-8068-266039879dc7)

Annotated is how each component of this system could be mapped to the Linux equivalent.

**Linux and Zephyr References**
[CPU Performance Scaling — The Linux Kernel documentation](https://www.kernel.org/doc/html/latest/admin-guide/pm/cpufreq.html)
[linux/drivers/cpufreq/cpufreq_governor.c at master · torvalds/linux](https://github.com/torvalds/linux/blob/master/drivers/cpufreq/cpufreq_governor.c)
[linux/drivers/cpufreq/cpufreq_ondemand.c at master · torvalds/linux](https://github.com/torvalds/linux/blob/master/drivers/cpufreq/cpufreq_ondemand.c)
[zephyr/soc/adi/max32/power.c at main · zephyrproject-rtos/zephyr](https://github.com/zephyrproject-rtos/zephyr/blob/main/soc/adi/max32/power.c)
[CPU load — Zephyr Project Documentation](https://docs.zephyrproject.org/latest/services/debugging/cpu_load.html)

**Initial Hardware for Implementation**
This feature can be developed initially for the MAX32655 which supports adjusting clock frequency at runtime.


### Dependencies

Since this is a new feature, I am not aware of other areas of the project it could impact. CPUFreq may be impacted by 
 [[RFC] Introduce Clock Management Subsystem (Clock Driver Based) by danieldegrasse · Pull Request #72102 · zephyrproject-rtos/zephyr](https://github.com/zephyrproject-rtos/zephyr/pull/72102), which may add support for dynamic clock frequency updates.

### Concerns and Unresolved Questions

_No response_

### Alternatives Considered

Since there exists previous art from the Linux kernel, it makes sense to base the Zephyr feature from there and modify it depending on what infrastructure already exists within the Zephyr tree.

Considered hooking this feature directly into the Zephyr power management subsystem and having it be part of the PM hooks that are ran in the idle thread. However, CPUFreq can be separated and used in isolation from power modes. Doing so makes the system more modular and the impacts of this PR more isolated.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] CPU Frequency Scaling Subsystem #92123

Problem Description

Proposed Change (Summary)

Proposed Change (Detailed)

Dependencies

Concerns and Unresolved Questions

Alternatives Considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] CPU Frequency Scaling Subsystem #92123

Description

Problem Description

Proposed Change (Summary)

Proposed Change (Detailed)

Dependencies

Concerns and Unresolved Questions

Alternatives Considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions