|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +================== |
| 4 | +The Page Allocator |
| 5 | +================== |
| 6 | + |
| 7 | +The kernel page allocator services all general page allocation requests, such |
| 8 | +as :code:`kmalloc`. CXL configuration steps affect the behavior of the page |
| 9 | +allocator based on the selected `Memory Zone` and `NUMA node` the capacity is |
| 10 | +placed in. |
| 11 | + |
| 12 | +This section mostly focuses on how these configurations affect the page |
| 13 | +allocator (as of Linux v6.15) rather than the overall page allocator behavior. |
| 14 | + |
| 15 | +NUMA nodes and mempolicy |
| 16 | +======================== |
| 17 | +Unless a task explicitly registers a mempolicy, the default memory policy |
| 18 | +of the linux kernel is to allocate memory from the `local NUMA node` first, |
| 19 | +and fall back to other nodes only if the local node is pressured. |
| 20 | + |
| 21 | +Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes, |
| 22 | +with the CXL memory being non-local. Technically, however, it is possible |
| 23 | +for a compute node to have no local DRAM, and for CXL memory to be the |
| 24 | +`local` capacity for that compute node. |
| 25 | + |
| 26 | + |
| 27 | +Memory Zones |
| 28 | +============ |
| 29 | +CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`. |
| 30 | + |
| 31 | +As of v6.15, the page allocator attempts to allocate from the highest |
| 32 | +available and compatible ZONE for an allocation from the local node first. |
| 33 | + |
| 34 | +An example of a `zone incompatibility` is attempting to service an allocation |
| 35 | +marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`. Kernel allocations are |
| 36 | +typically not migratable, and as a result can only be serviced from |
| 37 | +:code:`ZONE_NORMAL` or lower. |
| 38 | + |
| 39 | +To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over |
| 40 | +:code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it |
| 41 | +will fallback to allocate from :code:`ZONE_NORMAL`. |
| 42 | + |
| 43 | + |
| 44 | +Zone and Node Quirks |
| 45 | +==================== |
| 46 | +Let's consider a configuration where the local DRAM capacity is largely onlined |
| 47 | +into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The |
| 48 | +CXL capacity has the opposite configuration - all onlined in |
| 49 | +:code:`ZONE_MOVABLE`. |
| 50 | + |
| 51 | +Under the default allocation policy, the page allocator will completely skip |
| 52 | +:code:`ZONE_MOVABLE` as a valid allocation target. This is because, as of |
| 53 | +Linux v6.15, the page allocator does (approximately) the following: :: |
| 54 | + |
| 55 | + for (each zone in local_node): |
| 56 | + |
| 57 | + for (each node in fallback_order): |
| 58 | + |
| 59 | + attempt_allocation(gfp_flags); |
| 60 | + |
| 61 | +Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is |
| 62 | +functionally unreachable for direct allocation. As a result, the only way |
| 63 | +for CXL capacity to be used is via `demotion` in the reclaim path. |
| 64 | + |
| 65 | +This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE` |
| 66 | +capacity - when that capacity is depleted, the page allocator will actually |
| 67 | +prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages. |
| 68 | + |
| 69 | +We may wish to invert this priority in future Linux versions. |
| 70 | + |
| 71 | +If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes |
| 72 | +when the DRAM nodes are depleted. See the reclaim section for more details. |
| 73 | + |
| 74 | + |
| 75 | +CGroups and CPUSets |
| 76 | +=================== |
| 77 | +Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined |
| 78 | +in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by |
| 79 | +containers to limit the accessibility of certain NUMA nodes for tasks in that |
| 80 | +container. Users may wish to utilize this in multi-tenant systems where some |
| 81 | +tasks prefer not to use slower memory. |
| 82 | + |
| 83 | +In the reclaim section we'll discuss some limitations of this interface to |
| 84 | +prevent demotions of shared data to CXL memory (if demotions are enabled). |
| 85 | + |
0 commit comments