Skip to content

Commit bef826e

Browse files
Gregory Pricedavejiang
authored andcommitted
cxl: docs/linux - early boot configuration
Document __init time configurations that affect CXL driver probe process and memory region configuration. Signed-off-by: Gregory Price <gourry@gourry.net> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Link: https://patch.msgid.link/20250512162134.3596150-9-gourry@gourry.net Signed-off-by: Dave Jiang <dave.jiang@intel.com>
1 parent 9bd8546 commit bef826e

File tree

2 files changed

+132
-0
lines changed

2 files changed

+132
-0
lines changed

Documentation/driver-api/cxl/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ that have impacts on each other. The docs here break up configurations steps.
3434
:caption: Linux Kernel Configuration
3535

3636
linux/overview
37+
linux/early-boot
3738
linux/access-coordinates
3839

3940

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=======================
4+
Linux Init (Early Boot)
5+
=======================
6+
7+
Linux configuration is split into two major steps: Early-Boot and everything else.
8+
9+
During early boot, Linux sets up immutable resources (such as numa nodes), while
10+
later operations include things like driver probe and memory hotplug. Linux may
11+
read EFI and ACPI information throughout this process to configure logical
12+
representations of the devices.
13+
14+
During Linux Early Boot stage (functions in the kernel that have the __init
15+
decorator), the system takes the resources created by EFI/BIOS (ACPI tables)
16+
and turns them into resources that the kernel can consume.
17+
18+
19+
BIOS, Build and Boot Options
20+
============================
21+
22+
There are 4 pre-boot options that need to be considered during kernel build
23+
which dictate how memory will be managed by Linux during early boot.
24+
25+
* EFI_MEMORY_SP
26+
27+
* BIOS/EFI Option that dictates whether memory is SystemRAM or
28+
Specific Purpose. Specific Purpose memory will be deferred to
29+
drivers to manage - and not immediately exposed as system RAM.
30+
31+
* CONFIG_EFI_SOFT_RESERVE
32+
33+
* Linux Build config option that dictates whether the kernel supports
34+
Specific Purpose memory.
35+
36+
* CONFIG_MHP_DEFAULT_ONLINE_TYPE
37+
38+
* Linux Build config that dictates whether and how Specific Purpose memory
39+
converted to a dax device should be managed (left as DAX or onlined as
40+
SystemRAM in ZONE_NORMAL or ZONE_MOVABLE).
41+
42+
* nosoftreserve
43+
44+
* Linux kernel boot option that dictates whether Soft Reserve should be
45+
supported. Similar to CONFIG_EFI_SOFT_RESERVE.
46+
47+
Memory Map Creation
48+
===================
49+
50+
While the kernel parses the EFI memory map, if :code:`Specific Purpose` memory
51+
is supported and detected, it will set this region aside as
52+
:code:`SOFT_RESERVED`.
53+
54+
If :code:`EFI_MEMORY_SP=0`, :code:`CONFIG_EFI_SOFT_RESERVE=n`, or
55+
:code:`nosoftreserve=y` - Linux will default a CXL device memory region to
56+
SystemRAM. This will expose the memory to the kernel page allocator in
57+
:code:`ZONE_NORMAL`, making it available for use for most allocations (including
58+
:code:`struct page` and page tables).
59+
60+
If `Specific Purpose` is set and supported, :code:`CONFIG_MHP_DEFAULT_ONLINE_TYPE_*`
61+
dictates whether the memory is onlined by default (:code:`_OFFLINE` or
62+
:code:`_ONLINE_*`), and if online which zone to online this memory to by default
63+
(:code:`_NORMAL` or :code:`_MOVABLE`).
64+
65+
If placed in :code:`ZONE_MOVABLE`, the memory will not be available for most
66+
kernel allocations (such as :code:`struct page` or page tables). This may
67+
significant impact performance depending on the memory capacity of the system.
68+
69+
70+
NUMA Node Reservation
71+
=====================
72+
73+
Linux refers to the proximity domains (:code:`PXM`) defined in the SRAT to
74+
create NUMA nodes in :code:`acpi_numa_init`. Typically, there is a 1:1 relation
75+
between :code:`PXM` and NUMA node IDs.
76+
77+
SRAT is the only ACPI defined way of defining Proximity Domains. Linux chooses
78+
to, at most, map those 1:1 with NUMA nodes. CEDT adds a description of SPA
79+
ranges which Linux may wish to map to one or more NUMA nodes.
80+
81+
If there are CXL ranges in the CFMWS but not in SRAT, then a fake :code:`PXM`
82+
is created (as of v6.15). In the future, Linux may reject CFMWS not described
83+
by SRAT due to the ambiguity of proximity domain association.
84+
85+
It is important to note that NUMA node creation cannot be done at runtime. All
86+
possible NUMA nodes are identified at :code:`__init` time, more specifically
87+
during :code:`mm_init`. The CEDT and SRAT must contain sufficient :code:`PXM`
88+
data for Linux to identify NUMA nodes their associated memory regions.
89+
90+
The relevant code exists in: :code:`linux/drivers/acpi/numa/srat.c`.
91+
92+
See the Example Platform Configurations section for more information.
93+
94+
Memory Tiers Creation
95+
=====================
96+
Memory tiers are a collection of NUMA nodes grouped by performance characteristics.
97+
During :code:`__init`, Linux initializes the system with a default memory tier that
98+
contains all nodes marked :code:`N_MEMORY`.
99+
100+
:code:`memory_tier_init` is called at boot for all nodes with memory online by
101+
default. :code:`memory_tier_late_init` is called during late-init for nodes setup
102+
during driver configuration.
103+
104+
Nodes are only marked :code:`N_MEMORY` if they have *online* memory.
105+
106+
Tier membership can be inspected in ::
107+
108+
/sys/devices/virtual/memory_tiering/memory_tierN/nodelist
109+
0-1
110+
111+
If nodes are grouped which have clear difference in performance, check the HMAT
112+
and CDAT information for the CXL nodes. All nodes default to the DRAM tier,
113+
unless HMAT/CDAT information is reported to the memory_tier component via
114+
`access_coordinates`.
115+
116+
Contiguous Memory Allocation
117+
============================
118+
The contiguous memory allocator (CMA) enables reservation of contiguous memory
119+
regions on NUMA nodes during early boot. However, CMA cannot reserve memory
120+
on NUMA nodes that are not online during early boot. ::
121+
122+
void __init hugetlb_cma_reserve(int order) {
123+
if (!node_online(nid))
124+
/* do not allow reservations */
125+
}
126+
127+
This means if users intend to defer management of CXL memory to the driver, CMA
128+
cannot be used to guarantee huge page allocations. If enabling CXL memory as
129+
SystemRAM in `ZONE_NORMAL` during early boot, CMA reservations per-node can be
130+
made with the :code:`cma_pernuma` or :code:`numa_cma` kernel command line
131+
parameters.

0 commit comments

Comments
 (0)