Skip to content

Commit 419dc40

Browse files
Gregory Pricedavejiang
authored andcommitted
cxl: docs/allocation/page-allocator
Document some interesting interactions that occur when exposing CXL memory capacity to page allocator. Signed-off-by: Gregory Price <gourry@gourry.net> Link: https://patch.msgid.link/20250512162134.3596150-15-gourry@gourry.net Signed-off-by: Dave Jiang <dave.jiang@intel.com>
1 parent 78ab675 commit 419dc40

File tree

2 files changed

+86
-0
lines changed

2 files changed

+86
-0
lines changed
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
==================
4+
The Page Allocator
5+
==================
6+
7+
The kernel page allocator services all general page allocation requests, such
8+
as :code:`kmalloc`. CXL configuration steps affect the behavior of the page
9+
allocator based on the selected `Memory Zone` and `NUMA node` the capacity is
10+
placed in.
11+
12+
This section mostly focuses on how these configurations affect the page
13+
allocator (as of Linux v6.15) rather than the overall page allocator behavior.
14+
15+
NUMA nodes and mempolicy
16+
========================
17+
Unless a task explicitly registers a mempolicy, the default memory policy
18+
of the linux kernel is to allocate memory from the `local NUMA node` first,
19+
and fall back to other nodes only if the local node is pressured.
20+
21+
Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes,
22+
with the CXL memory being non-local. Technically, however, it is possible
23+
for a compute node to have no local DRAM, and for CXL memory to be the
24+
`local` capacity for that compute node.
25+
26+
27+
Memory Zones
28+
============
29+
CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`.
30+
31+
As of v6.15, the page allocator attempts to allocate from the highest
32+
available and compatible ZONE for an allocation from the local node first.
33+
34+
An example of a `zone incompatibility` is attempting to service an allocation
35+
marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`. Kernel allocations are
36+
typically not migratable, and as a result can only be serviced from
37+
:code:`ZONE_NORMAL` or lower.
38+
39+
To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over
40+
:code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it
41+
will fallback to allocate from :code:`ZONE_NORMAL`.
42+
43+
44+
Zone and Node Quirks
45+
====================
46+
Let's consider a configuration where the local DRAM capacity is largely onlined
47+
into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The
48+
CXL capacity has the opposite configuration - all onlined in
49+
:code:`ZONE_MOVABLE`.
50+
51+
Under the default allocation policy, the page allocator will completely skip
52+
:code:`ZONE_MOVABLE` as a valid allocation target. This is because, as of
53+
Linux v6.15, the page allocator does (approximately) the following: ::
54+
55+
for (each zone in local_node):
56+
57+
for (each node in fallback_order):
58+
59+
attempt_allocation(gfp_flags);
60+
61+
Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is
62+
functionally unreachable for direct allocation. As a result, the only way
63+
for CXL capacity to be used is via `demotion` in the reclaim path.
64+
65+
This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE`
66+
capacity - when that capacity is depleted, the page allocator will actually
67+
prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages.
68+
69+
We may wish to invert this priority in future Linux versions.
70+
71+
If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes
72+
when the DRAM nodes are depleted. See the reclaim section for more details.
73+
74+
75+
CGroups and CPUSets
76+
===================
77+
Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined
78+
in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by
79+
containers to limit the accessibility of certain NUMA nodes for tasks in that
80+
container. Users may wish to utilize this in multi-tenant systems where some
81+
tasks prefer not to use slower memory.
82+
83+
In the reclaim section we'll discuss some limitations of this interface to
84+
prevent demotions of shared data to CXL memory (if demotions are enabled).
85+

Documentation/driver-api/cxl/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,5 +45,6 @@ that have impacts on each other. The docs here break up configurations steps.
4545
:caption: Memory Allocation
4646

4747
allocation/dax
48+
allocation/page-allocator
4849

4950
.. only:: subproject and html

0 commit comments

Comments
 (0)