|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +======================= |
| 4 | +Linux Init (Early Boot) |
| 5 | +======================= |
| 6 | + |
| 7 | +Linux configuration is split into two major steps: Early-Boot and everything else. |
| 8 | + |
| 9 | +During early boot, Linux sets up immutable resources (such as numa nodes), while |
| 10 | +later operations include things like driver probe and memory hotplug. Linux may |
| 11 | +read EFI and ACPI information throughout this process to configure logical |
| 12 | +representations of the devices. |
| 13 | + |
| 14 | +During Linux Early Boot stage (functions in the kernel that have the __init |
| 15 | +decorator), the system takes the resources created by EFI/BIOS (ACPI tables) |
| 16 | +and turns them into resources that the kernel can consume. |
| 17 | + |
| 18 | + |
| 19 | +BIOS, Build and Boot Options |
| 20 | +============================ |
| 21 | + |
| 22 | +There are 4 pre-boot options that need to be considered during kernel build |
| 23 | +which dictate how memory will be managed by Linux during early boot. |
| 24 | + |
| 25 | +* EFI_MEMORY_SP |
| 26 | + |
| 27 | + * BIOS/EFI Option that dictates whether memory is SystemRAM or |
| 28 | + Specific Purpose. Specific Purpose memory will be deferred to |
| 29 | + drivers to manage - and not immediately exposed as system RAM. |
| 30 | + |
| 31 | +* CONFIG_EFI_SOFT_RESERVE |
| 32 | + |
| 33 | + * Linux Build config option that dictates whether the kernel supports |
| 34 | + Specific Purpose memory. |
| 35 | + |
| 36 | +* CONFIG_MHP_DEFAULT_ONLINE_TYPE |
| 37 | + |
| 38 | + * Linux Build config that dictates whether and how Specific Purpose memory |
| 39 | + converted to a dax device should be managed (left as DAX or onlined as |
| 40 | + SystemRAM in ZONE_NORMAL or ZONE_MOVABLE). |
| 41 | + |
| 42 | +* nosoftreserve |
| 43 | + |
| 44 | + * Linux kernel boot option that dictates whether Soft Reserve should be |
| 45 | + supported. Similar to CONFIG_EFI_SOFT_RESERVE. |
| 46 | + |
| 47 | +Memory Map Creation |
| 48 | +=================== |
| 49 | + |
| 50 | +While the kernel parses the EFI memory map, if :code:`Specific Purpose` memory |
| 51 | +is supported and detected, it will set this region aside as |
| 52 | +:code:`SOFT_RESERVED`. |
| 53 | + |
| 54 | +If :code:`EFI_MEMORY_SP=0`, :code:`CONFIG_EFI_SOFT_RESERVE=n`, or |
| 55 | +:code:`nosoftreserve=y` - Linux will default a CXL device memory region to |
| 56 | +SystemRAM. This will expose the memory to the kernel page allocator in |
| 57 | +:code:`ZONE_NORMAL`, making it available for use for most allocations (including |
| 58 | +:code:`struct page` and page tables). |
| 59 | + |
| 60 | +If `Specific Purpose` is set and supported, :code:`CONFIG_MHP_DEFAULT_ONLINE_TYPE_*` |
| 61 | +dictates whether the memory is onlined by default (:code:`_OFFLINE` or |
| 62 | +:code:`_ONLINE_*`), and if online which zone to online this memory to by default |
| 63 | +(:code:`_NORMAL` or :code:`_MOVABLE`). |
| 64 | + |
| 65 | +If placed in :code:`ZONE_MOVABLE`, the memory will not be available for most |
| 66 | +kernel allocations (such as :code:`struct page` or page tables). This may |
| 67 | +significant impact performance depending on the memory capacity of the system. |
| 68 | + |
| 69 | + |
| 70 | +NUMA Node Reservation |
| 71 | +===================== |
| 72 | + |
| 73 | +Linux refers to the proximity domains (:code:`PXM`) defined in the SRAT to |
| 74 | +create NUMA nodes in :code:`acpi_numa_init`. Typically, there is a 1:1 relation |
| 75 | +between :code:`PXM` and NUMA node IDs. |
| 76 | + |
| 77 | +SRAT is the only ACPI defined way of defining Proximity Domains. Linux chooses |
| 78 | +to, at most, map those 1:1 with NUMA nodes. CEDT adds a description of SPA |
| 79 | +ranges which Linux may wish to map to one or more NUMA nodes. |
| 80 | + |
| 81 | +If there are CXL ranges in the CFMWS but not in SRAT, then a fake :code:`PXM` |
| 82 | +is created (as of v6.15). In the future, Linux may reject CFMWS not described |
| 83 | +by SRAT due to the ambiguity of proximity domain association. |
| 84 | + |
| 85 | +It is important to note that NUMA node creation cannot be done at runtime. All |
| 86 | +possible NUMA nodes are identified at :code:`__init` time, more specifically |
| 87 | +during :code:`mm_init`. The CEDT and SRAT must contain sufficient :code:`PXM` |
| 88 | +data for Linux to identify NUMA nodes their associated memory regions. |
| 89 | + |
| 90 | +The relevant code exists in: :code:`linux/drivers/acpi/numa/srat.c`. |
| 91 | + |
| 92 | +See the Example Platform Configurations section for more information. |
| 93 | + |
| 94 | +Memory Tiers Creation |
| 95 | +===================== |
| 96 | +Memory tiers are a collection of NUMA nodes grouped by performance characteristics. |
| 97 | +During :code:`__init`, Linux initializes the system with a default memory tier that |
| 98 | +contains all nodes marked :code:`N_MEMORY`. |
| 99 | + |
| 100 | +:code:`memory_tier_init` is called at boot for all nodes with memory online by |
| 101 | +default. :code:`memory_tier_late_init` is called during late-init for nodes setup |
| 102 | +during driver configuration. |
| 103 | + |
| 104 | +Nodes are only marked :code:`N_MEMORY` if they have *online* memory. |
| 105 | + |
| 106 | +Tier membership can be inspected in :: |
| 107 | + |
| 108 | + /sys/devices/virtual/memory_tiering/memory_tierN/nodelist |
| 109 | + 0-1 |
| 110 | + |
| 111 | +If nodes are grouped which have clear difference in performance, check the HMAT |
| 112 | +and CDAT information for the CXL nodes. All nodes default to the DRAM tier, |
| 113 | +unless HMAT/CDAT information is reported to the memory_tier component via |
| 114 | +`access_coordinates`. |
| 115 | + |
| 116 | +Contiguous Memory Allocation |
| 117 | +============================ |
| 118 | +The contiguous memory allocator (CMA) enables reservation of contiguous memory |
| 119 | +regions on NUMA nodes during early boot. However, CMA cannot reserve memory |
| 120 | +on NUMA nodes that are not online during early boot. :: |
| 121 | + |
| 122 | + void __init hugetlb_cma_reserve(int order) { |
| 123 | + if (!node_online(nid)) |
| 124 | + /* do not allow reservations */ |
| 125 | + } |
| 126 | + |
| 127 | +This means if users intend to defer management of CXL memory to the driver, CMA |
| 128 | +cannot be used to guarantee huge page allocations. If enabling CXL memory as |
| 129 | +SystemRAM in `ZONE_NORMAL` during early boot, CMA reservations per-node can be |
| 130 | +made with the :code:`cma_pernuma` or :code:`numa_cma` kernel command line |
| 131 | +parameters. |
0 commit comments