Skip to content

Commit 58dfd95

Browse files
committed
Merge branch 'for-6.16/cxl-docs' into cxl-for-next
Detailed documentation for the entire CXL sub-system from platform, BIOS, to CXL driver, memory interface, memory hotplug, and others.
2 parents a223ce1 + dba600d commit 58dfd95

29 files changed

+3867
-26
lines changed
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
===========
4+
DAX Devices
5+
===========
6+
CXL capacity exposed as a DAX device can be accessed directly via mmap.
7+
Users may wish to use this interface mechanism to write their own userland
8+
CXL allocator, or to managed shared or persistent memory regions across multiple
9+
hosts.
10+
11+
If the capacity is shared across hosts or persistent, appropriate flushing
12+
mechanisms must be employed unless the region supports Snoop Back-Invalidate.
13+
14+
Note that mappings must be aligned (size and base) to the dax device's base
15+
alignment, which is typically 2MB - but maybe be configured larger.
16+
17+
::
18+
19+
#include <stdio.h>
20+
#include <stdlib.h>
21+
#include <stdint.h>
22+
#include <sys/mman.h>
23+
#include <fcntl.h>
24+
#include <unistd.h>
25+
26+
#define DEVICE_PATH "/dev/dax0.0" // Replace DAX device path
27+
#define DEVICE_SIZE (4ULL * 1024 * 1024 * 1024) // 4GB
28+
29+
int main() {
30+
int fd;
31+
void* mapped_addr;
32+
33+
/* Open the DAX device */
34+
fd = open(DEVICE_PATH, O_RDWR);
35+
if (fd < 0) {
36+
perror("open");
37+
return -1;
38+
}
39+
40+
/* Map the device into memory */
41+
mapped_addr = mmap(NULL, DEVICE_SIZE, PROT_READ | PROT_WRITE,
42+
MAP_SHARED, fd, 0);
43+
if (mapped_addr == MAP_FAILED) {
44+
perror("mmap");
45+
close(fd);
46+
return -1;
47+
}
48+
49+
printf("Mapped address: %p\n", mapped_addr);
50+
51+
/* You can now access the device through the mapped address */
52+
uint64_t* ptr = (uint64_t*)mapped_addr;
53+
*ptr = 0x1234567890abcdef; // Write a value to the device
54+
printf("Value at address %p: 0x%016llx\n", ptr, *ptr);
55+
56+
/* Clean up */
57+
munmap(mapped_addr, DEVICE_SIZE);
58+
close(fd);
59+
return 0;
60+
}
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
==========
4+
Huge Pages
5+
==========
6+
7+
Contiguous Memory Allocator
8+
===========================
9+
CXL Memory onlined as SystemRAM during early boot is eligible for use by CMA,
10+
as the NUMA node hosting that capacity will be `Online` at the time CMA
11+
carves out contiguous capacity.
12+
13+
CXL Memory deferred to the CXL Driver for configuration cannot have its
14+
capacity allocated by CMA - as the NUMA node hosting the capacity is `Offline`
15+
at :code:`__init` time - when CMA carves out contiguous capacity.
16+
17+
HugeTLB
18+
=======
19+
Different huge page sizes allow different memory configurations.
20+
21+
2MB Huge Pages
22+
--------------
23+
All CXL capacity regardless of configuration time or memory zone is eligible
24+
for use as 2MB huge pages.
25+
26+
1GB Huge Pages
27+
--------------
28+
CXL capacity onlined in :code:`ZONE_NORMAL` is eligible for 1GB Gigantic Page
29+
allocation.
30+
31+
CXL capacity onlined in :code:`ZONE_MOVABLE` is not eligible for 1GB Gigantic
32+
Page allocation.
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
==================
4+
The Page Allocator
5+
==================
6+
7+
The kernel page allocator services all general page allocation requests, such
8+
as :code:`kmalloc`. CXL configuration steps affect the behavior of the page
9+
allocator based on the selected `Memory Zone` and `NUMA node` the capacity is
10+
placed in.
11+
12+
This section mostly focuses on how these configurations affect the page
13+
allocator (as of Linux v6.15) rather than the overall page allocator behavior.
14+
15+
NUMA nodes and mempolicy
16+
========================
17+
Unless a task explicitly registers a mempolicy, the default memory policy
18+
of the linux kernel is to allocate memory from the `local NUMA node` first,
19+
and fall back to other nodes only if the local node is pressured.
20+
21+
Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes,
22+
with the CXL memory being non-local. Technically, however, it is possible
23+
for a compute node to have no local DRAM, and for CXL memory to be the
24+
`local` capacity for that compute node.
25+
26+
27+
Memory Zones
28+
============
29+
CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`.
30+
31+
As of v6.15, the page allocator attempts to allocate from the highest
32+
available and compatible ZONE for an allocation from the local node first.
33+
34+
An example of a `zone incompatibility` is attempting to service an allocation
35+
marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`. Kernel allocations are
36+
typically not migratable, and as a result can only be serviced from
37+
:code:`ZONE_NORMAL` or lower.
38+
39+
To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over
40+
:code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it
41+
will fallback to allocate from :code:`ZONE_NORMAL`.
42+
43+
44+
Zone and Node Quirks
45+
====================
46+
Let's consider a configuration where the local DRAM capacity is largely onlined
47+
into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The
48+
CXL capacity has the opposite configuration - all onlined in
49+
:code:`ZONE_MOVABLE`.
50+
51+
Under the default allocation policy, the page allocator will completely skip
52+
:code:`ZONE_MOVABLE` as a valid allocation target. This is because, as of
53+
Linux v6.15, the page allocator does (approximately) the following: ::
54+
55+
for (each zone in local_node):
56+
57+
for (each node in fallback_order):
58+
59+
attempt_allocation(gfp_flags);
60+
61+
Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is
62+
functionally unreachable for direct allocation. As a result, the only way
63+
for CXL capacity to be used is via `demotion` in the reclaim path.
64+
65+
This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE`
66+
capacity - when that capacity is depleted, the page allocator will actually
67+
prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages.
68+
69+
We may wish to invert this priority in future Linux versions.
70+
71+
If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes
72+
when the DRAM nodes are depleted. See the reclaim section for more details.
73+
74+
75+
CGroups and CPUSets
76+
===================
77+
Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined
78+
in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by
79+
containers to limit the accessibility of certain NUMA nodes for tasks in that
80+
container. Users may wish to utilize this in multi-tenant systems where some
81+
tasks prefer not to use slower memory.
82+
83+
In the reclaim section we'll discuss some limitations of this interface to
84+
prevent demotions of shared data to CXL memory (if demotions are enabled).
85+
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=======
4+
Reclaim
5+
=======
6+
Another way CXL memory can be utilized *indirectly* is via the reclaim system
7+
in :code:`mm/vmscan.c`. Reclaim is engaged when memory capacity on the system
8+
becomes pressured based on global and cgroup-local `watermark` settings.
9+
10+
In this section we won't discuss the `watermark` configurations, just how CXL
11+
memory can be consumed by various pieces of reclaim system.
12+
13+
Demotion
14+
========
15+
By default, the reclaim system will prefer swap (or zswap) when reclaiming
16+
memory. Enabling :code:`kernel/mm/numa/demotion_enabled` will cause vmscan
17+
to opportunistically prefer distant NUMA nodes to swap or zswap, if capacity
18+
is available.
19+
20+
Demotion engages the :code:`mm/memory_tier.c` component to determine the
21+
next demotion node. The next demotion node is based on the :code:`HMAT`
22+
or :code:`CDAT` performance data.
23+
24+
cpusets.mems_allowed quirk
25+
--------------------------
26+
In Linux v6.15 and below, demotion does not respect :code:`cpusets.mems_allowed`
27+
when migrating pages. As a result, if demotion is enabled, vmscan cannot
28+
guarantee isolation of a container's memory from nodes not set in mems_allowed.
29+
30+
In Linux v6.XX and up, demotion does attempt to respect
31+
:code:`cpusets.mems_allowed`; however, certain classes of shared memory
32+
originally instantiated by another cgroup (such as common libraries - e.g.
33+
libc) may still be demoted. As a result, the mems_allowed interface still
34+
cannot provide perfect isolation from the remote nodes.
35+
36+
ZSwap and Node Preference
37+
=========================
38+
In Linux v6.15 and below, ZSwap allocates memory from the local node of the
39+
processor for the new pages being compressed. Since pages being compressed
40+
are typically cold, the result is a cold page becomes promoted - only to
41+
be later demoted as it ages off the LRU.
42+
43+
In Linux v6.XX, ZSwap tries to prefer the node of the page being compressed
44+
as the allocation target for the compression page. This helps prevent
45+
thrashing.
46+
47+
Demotion with ZSwap
48+
===================
49+
When enabling both Demotion and ZSwap, you create a situation where ZSwap
50+
will prefer the slowest form of CXL memory by default until that tier of
51+
memory is exhausted.
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=====================
4+
Devices and Protocols
5+
=====================
6+
7+
The type of CXL device (Memory, Accelerator, etc) dictates many configuration steps. This section
8+
covers some basic background on device types and on-device resources used by the platform and OS
9+
which impact configuration.
10+
11+
Protocols
12+
=========
13+
14+
There are three core protocols to CXL. For the purpose of this documentation,
15+
we will only discuss very high level definitions as the specific hardware
16+
details are largely abstracted away from Linux. See the CXL specification
17+
for more details.
18+
19+
CXL.io
20+
------
21+
The basic interaction protocol, similar to PCIe configuration mechanisms.
22+
Typically used for initialization, configuration, and I/O access for anything
23+
other than memory (CXL.mem) or cache (CXL.cache) operations.
24+
25+
The Linux CXL driver exposes access to .io functionalty via the various sysfs
26+
interfaces and /dev/cxl/ devices (which exposes direct access to device
27+
mailboxes).
28+
29+
CXL.cache
30+
---------
31+
The mechanism by which a device may coherently access and cache host memory.
32+
33+
Largely transparent to Linux once configured.
34+
35+
CXL.mem
36+
---------
37+
The mechanism by which the CPU may coherently access and cache device memory.
38+
39+
Largely transparent to Linux once configured.
40+
41+
42+
Device Types
43+
============
44+
45+
Type-1
46+
------
47+
48+
A Type-1 CXL device:
49+
50+
* Supports cxl.io and cxl.cache protocols
51+
* Implements a fully coherent cache
52+
* Allows Device-to-Host coherence and Host-to-Device snoops.
53+
* Does NOT have host-managed device memory (HDM)
54+
55+
Typical examples of type-1 devices is a Smart NIC - which may want to
56+
directly operate on host-memory (DMA) to store incoming packets. These
57+
devices largely rely on CPU-attached memory.
58+
59+
Type-2
60+
------
61+
62+
A Type-2 CXL Device:
63+
64+
* Supports cxl.io, cxl.cache, and cxl.mem protocols
65+
* Optionally implements coherent cache and Host-Managed Device Memory
66+
* Is typically an accelerator device w/ high bandwidth memory.
67+
68+
The primary difference between a type-1 and type-2 device is the presence
69+
of host-managed device memory, which allows the device to operate on a
70+
local memory bank - while the CPU sill has coherent DMA to the same memory.
71+
72+
The allows things like GPUs to expose their memory via DAX devices or file
73+
descriptors, allows drivers and programs direct access to device memory
74+
rather than use block-transfer semantics.
75+
76+
Type-3
77+
------
78+
79+
A Type-3 CXL Device
80+
81+
* Supports cxl.io and cxl.mem
82+
* Implements Host-Managed Device Memory
83+
* May provide either Volatile or Persistent memory capacity (or both).
84+
85+
A basic example of a type-3 device is a simple memory expander, whose
86+
local memory capacity is exposed to the CPU for access directly via
87+
basic coherent DMA.
88+
89+
Switch
90+
------
91+
92+
A CXL switch is a device capacity of routing any CXL (and by extension, PCIe)
93+
protocol between an upstream, downstream, or peer devices. Many devices, such
94+
as Multi-Logical Devices, imply the presence of switching in some manner.
95+
96+
Logical Devices and Heads
97+
-------------------------
98+
99+
A CXL device may present one or more "Logical Devices" to one or more hosts
100+
(via physical "Heads").
101+
102+
A Single-Logical Device (SLD) is a device which presents a single device to
103+
one or more heads.
104+
105+
A Multi-Logical Device (MLD) is a device which may present multiple devices
106+
to one or more devices.
107+
108+
A Single-Headed Device exposes only a single physical connection.
109+
110+
A Multi-Headed Device exposes multiple physical connections.
111+
112+
MHSLD
113+
~~~~~
114+
A Multi-Headed Single-Logical Device (MHSLD) exposes a single logical
115+
device to multiple heads which may be connected to one or more discrete
116+
hosts. An example of this would be a simple memory-pool which may be
117+
statically configured (prior to boot) to expose portions of its memory
118+
to Linux via :doc:`CEDT <../platform/acpi/cedt>`.
119+
120+
MHMLD
121+
~~~~~
122+
A Multi-Headed Multi-Logical Device (MHMLD) exposes multiple logical
123+
devices to multiple heads which may be connected to one or more discrete
124+
hosts. An example of this would be a Dynamic Capacity Device or which
125+
may be configured at runtime to expose portions of its memory to Linux.
126+
127+
Example Devices
128+
===============
129+
130+
Memory Expander
131+
---------------
132+
The simplest form of Type-3 device is a memory expander. A memory expander
133+
exposes Host-Managed Device Memory (HDM) to Linux. This memory may be
134+
Volatile or Non-Volatile (Persistent).
135+
136+
Memory Expanders will typically be considered a form of Single-Headed,
137+
Single-Logical Device - as its form factor will typically be an add-in-card
138+
(AIC) or some other similar form-factor.
139+
140+
The Linux CXL driver provides support for static or dynamic configuration of
141+
basic memory expanders. The platform may program decoders prior to OS init
142+
(e.g. auto-decoders), or the user may program the fabric if the platform
143+
defers these operations to the OS.
144+
145+
Multiple Memory Expanders may be added to an external chassis and exposed to
146+
a host via a head attached to a CXL switch. This is a "memory pool", and
147+
would be considered an MHSLD or MHMLD depending on the management capabilities
148+
provided by the switch platform.
149+
150+
As of v6.14, Linux does not provide a formalized interface to manage non-DCD
151+
MHSLD or MHMLD devices.
152+
153+
Dynamic Capacity Device (DCD)
154+
-----------------------------
155+
156+
A Dynamic Capacity Device is a Type-3 device which provides dynamic management
157+
of memory capacity. The basic premise of a DCD to provide an allocator-like
158+
interface for physical memory capacity to a "Fabric Manager" (an external,
159+
privileged host with privileges to change configurations for other hosts).
160+
161+
A DCD manages "Memory Extents", which may be volatile or persistent. Extents
162+
may also be exclusive to a single host or shared across multiple hosts.
163+
164+
As of v6.14, Linux does not provide a formalized interface to manage DCD
165+
devices, however there is active work on LKML targeting future release.

0 commit comments

Comments
 (0)