Skip to content

Commit f49040c

Browse files
committed
Merge branch 'for-6.15-console-suspend-api-cleanup' into for-linus
2 parents c1aa3da + 72c96a2 commit f49040c

File tree

1,320 files changed

+54247
-17595
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,320 files changed

+54247
-17595
lines changed

Documentation/ABI/testing/sysfs-kernel-livepatch

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,15 @@ Description:
5555
An attribute which indicates whether the patch supports
5656
atomic-replace.
5757

58+
What: /sys/kernel/livepatch/<patch>/stack_order
59+
Date: Jan 2025
60+
KernelVersion: 6.14.0
61+
Description:
62+
This attribute specifies the sequence in which live patch modules
63+
are applied to the system. If multiple live patches modify the same
64+
function, the implementation with the biggest 'stack_order' number
65+
is used, unless a transition is currently in progress.
66+
5867
What: /sys/kernel/livepatch/<patch>/<object>
5968
Date: Nov 2014
6069
KernelVersion: 3.19.0
Lines changed: 281 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,281 @@
1+
.. SPDX-License-Identifier: GPL-2.0-only
2+
3+
.. include:: <isonum.txt>
4+
5+
=========
6+
AMD NPU
7+
=========
8+
9+
:Copyright: |copy| 2024 Advanced Micro Devices, Inc.
10+
:Author: Sonal Santan <sonal.santan@amd.com>
11+
12+
Overview
13+
========
14+
15+
AMD NPU (Neural Processing Unit) is a multi-user AI inference accelerator
16+
integrated into AMD client APU. NPU enables efficient execution of Machine
17+
Learning applications like CNN, LLM, etc. NPU is based on
18+
`AMD XDNA Architecture`_. NPU is managed by **amdxdna** driver.
19+
20+
21+
Hardware Description
22+
====================
23+
24+
AMD NPU consists of the following hardware components:
25+
26+
AMD XDNA Array
27+
--------------
28+
29+
AMD XDNA Array comprises of 2D array of compute and memory tiles built with
30+
`AMD AI Engine Technology`_. Each column has 4 rows of compute tiles and 1
31+
row of memory tile. Each compute tile contains a VLIW processor with its own
32+
dedicated program and data memory. The memory tile acts as L2 memory. The 2D
33+
array can be partitioned at a column boundary creating a spatially isolated
34+
partition which can be bound to a workload context.
35+
36+
Each column also has dedicated DMA engines to move data between host DDR and
37+
memory tile.
38+
39+
AMD Phoenix and AMD Hawk Point client NPU have a 4x5 topology, i.e., 4 rows of
40+
compute tiles arranged into 5 columns. AMD Strix Point client APU have 4x8
41+
topology, i.e., 4 rows of compute tiles arranged into 8 columns.
42+
43+
Shared L2 Memory
44+
----------------
45+
46+
The single row of memory tiles create a pool of software managed on chip L2
47+
memory. DMA engines are used to move data between host DDR and memory tiles.
48+
AMD Phoenix and AMD Hawk Point NPUs have a total of 2560 KB of L2 memory.
49+
AMD Strix Point NPU has a total of 4096 KB of L2 memory.
50+
51+
Microcontroller
52+
---------------
53+
54+
A microcontroller runs NPU Firmware which is responsible for command processing,
55+
XDNA Array partition setup, XDNA Array configuration, workload context
56+
management and workload orchestration.
57+
58+
NPU Firmware uses a dedicated instance of an isolated non-privileged context
59+
called ERT to service each workload context. ERT is also used to execute user
60+
provided ``ctrlcode`` associated with the workload context.
61+
62+
NPU Firmware uses a single isolated privileged context called MERT to service
63+
management commands from the amdxdna driver.
64+
65+
Mailboxes
66+
---------
67+
68+
The microcontroller and amdxdna driver use a privileged channel for management
69+
tasks like setting up of contexts, telemetry, query, error handling, setting up
70+
user channel, etc. As mentioned before, privileged channel requests are
71+
serviced by MERT. The privileged channel is bound to a single mailbox.
72+
73+
The microcontroller and amdxdna driver use a dedicated user channel per
74+
workload context. The user channel is primarily used for submitting work to
75+
the NPU. As mentioned before, a user channel requests are serviced by an
76+
instance of ERT. Each user channel is bound to its own dedicated mailbox.
77+
78+
PCIe EP
79+
-------
80+
81+
NPU is visible to the x86 host CPU as a PCIe device with multiple BARs and some
82+
MSI-X interrupt vectors. NPU uses a dedicated high bandwidth SoC level fabric
83+
for reading or writing into host memory. Each instance of ERT gets its own
84+
dedicated MSI-X interrupt. MERT gets a single instance of MSI-X interrupt.
85+
86+
The number of PCIe BARs varies depending on the specific device. Based on their
87+
functions, PCIe BARs can generally be categorized into the following types.
88+
89+
* PSP BAR: Expose the AMD PSP (Platform Security Processor) function
90+
* SMU BAR: Expose the AMD SMU (System Management Unit) function
91+
* SRAM BAR: Expose ring buffers for the mailbox
92+
* Mailbox BAR: Expose the mailbox control registers (head, tail and ISR
93+
registers etc.)
94+
* Public Register BAR: Expose public registers
95+
96+
On specific devices, the above-mentioned BAR type might be combined into a
97+
single physical PCIe BAR. Or a module might require two physical PCIe BARs to
98+
be fully functional. For example,
99+
100+
* On AMD Phoenix device, PSP, SMU, Public Register BARs are on PCIe BAR index 0.
101+
* On AMD Strix Point device, Mailbox and Public Register BARs are on PCIe BAR
102+
index 0. The PSP has some registers in PCIe BAR index 0 (Public Register BAR)
103+
and PCIe BAR index 4 (PSP BAR).
104+
105+
Process Isolation Hardware
106+
--------------------------
107+
108+
As explained before, XDNA Array can be dynamically divided into isolated
109+
spatial partitions, each of which may have one or more columns. The spatial
110+
partition is setup by programming the column isolation registers by the
111+
microcontroller. Each spatial partition is associated with a PASID which is
112+
also programmed by the microcontroller. Hence multiple spatial partitions in
113+
the NPU can make concurrent host access protected by PASID.
114+
115+
The NPU FW itself uses microcontroller MMU enforced isolated contexts for
116+
servicing user and privileged channel requests.
117+
118+
119+
Mixed Spatial and Temporal Scheduling
120+
=====================================
121+
122+
AMD XDNA architecture supports mixed spatial and temporal (time sharing)
123+
scheduling of 2D array. This means that spatial partitions may be setup and
124+
torn down dynamically to accommodate various workloads. A *spatial* partition
125+
may be *exclusively* bound to one workload context while another partition may
126+
be *temporarily* bound to more than one workload contexts. The microcontroller
127+
updates the PASID for a temporarily shared partition to match the context that
128+
has been bound to the partition at any moment.
129+
130+
Resource Solver
131+
---------------
132+
133+
The Resource Solver component of the amdxdna driver manages the allocation
134+
of 2D array among various workloads. Every workload describes the number
135+
of columns required to run the NPU binary in its metadata. The Resource Solver
136+
component uses hints passed by the workload and its own heuristics to
137+
decide 2D array (re)partition strategy and mapping of workloads for spatial and
138+
temporal sharing of columns. The FW enforces the context-to-column(s) resource
139+
binding decisions made by the Resource Solver.
140+
141+
AMD Phoenix and AMD Hawk Point client NPU can support 6 concurrent workload
142+
contexts. AMD Strix Point can support 16 concurrent workload contexts.
143+
144+
145+
Application Binaries
146+
====================
147+
148+
A NPU application workload is comprised of two separate binaries which are
149+
generated by the NPU compiler.
150+
151+
1. AMD XDNA Array overlay, which is used to configure a NPU spatial partition.
152+
The overlay contains instructions for setting up the stream switch
153+
configuration and ELF for the compute tiles. The overlay is loaded on the
154+
spatial partition bound to the workload by the associated ERT instance.
155+
Refer to the
156+
`Versal Adaptive SoC AIE-ML Architecture Manual (AM020)`_ for more details.
157+
158+
2. ``ctrlcode``, used for orchestrating the overlay loaded on the spatial
159+
partition. ``ctrlcode`` is executed by the ERT running in protected mode on
160+
the microcontroller in the context of the workload. ``ctrlcode`` is made up
161+
of a sequence of opcodes named ``XAie_TxnOpcode``. Refer to the
162+
`AI Engine Run Time`_ for more details.
163+
164+
165+
Special Host Buffers
166+
====================
167+
168+
Per-context Instruction Buffer
169+
------------------------------
170+
171+
Every workload context uses a host resident 64 MB buffer which is memory
172+
mapped into the ERT instance created to service the workload. The ``ctrlcode``
173+
used by the workload is copied into this special memory. This buffer is
174+
protected by PASID like all other input/output buffers used by that workload.
175+
Instruction buffer is also mapped into the user space of the workload.
176+
177+
Global Privileged Buffer
178+
------------------------
179+
180+
In addition, the driver also allocates a single buffer for maintenance tasks
181+
like recording errors from MERT. This global buffer uses the global IOMMU
182+
domain and is only accessible by MERT.
183+
184+
185+
High-level Use Flow
186+
===================
187+
188+
Here are the steps to run a workload on AMD NPU:
189+
190+
1. Compile the workload into an overlay and a ``ctrlcode`` binary.
191+
2. Userspace opens a context in the driver and provides the overlay.
192+
3. The driver checks with the Resource Solver for provisioning a set of columns
193+
for the workload.
194+
4. The driver then asks MERT to create a context on the device with the desired
195+
columns.
196+
5. MERT then creates an instance of ERT. MERT also maps the Instruction Buffer
197+
into ERT memory.
198+
6. The userspace then copies the ``ctrlcode`` to the Instruction Buffer.
199+
7. Userspace then creates a command buffer with pointers to input, output, and
200+
instruction buffer; it then submits command buffer with the driver and goes
201+
to sleep waiting for completion.
202+
8. The driver sends the command over the Mailbox to ERT.
203+
9. ERT *executes* the ``ctrlcode`` in the instruction buffer.
204+
10. Execution of the ``ctrlcode`` kicks off DMAs to and from the host DDR while
205+
AMD XDNA Array is running.
206+
11. When ERT reaches end of ``ctrlcode``, it raises an MSI-X to send completion
207+
signal to the driver which then wakes up the waiting workload.
208+
209+
210+
Boot Flow
211+
=========
212+
213+
amdxdna driver uses PSP to securely load signed NPU FW and kick off the boot
214+
of the NPU microcontroller. amdxdna driver then waits for the alive signal in
215+
a special location on BAR 0. The NPU is switched off during SoC suspend and
216+
turned on after resume where the NPU FW is reloaded, and the handshake is
217+
performed again.
218+
219+
220+
Userspace components
221+
====================
222+
223+
Compiler
224+
--------
225+
226+
Peano is an LLVM based open-source compiler for AMD XDNA Array compute tile
227+
available at:
228+
https://github.com/Xilinx/llvm-aie
229+
230+
The open-source IREE compiler supports graph compilation of ML models for AMD
231+
NPU and uses Peano underneath. It is available at:
232+
https://github.com/nod-ai/iree-amd-aie
233+
234+
Usermode Driver (UMD)
235+
---------------------
236+
237+
The open-source XRT runtime stack interfaces with amdxdna kernel driver. XRT
238+
can be found at:
239+
https://github.com/Xilinx/XRT
240+
241+
The open-source XRT shim for NPU is can be found at:
242+
https://github.com/amd/xdna-driver
243+
244+
245+
DMA Operation
246+
=============
247+
248+
DMA operation instructions are encoded in the ``ctrlcode`` as
249+
``XAIE_IO_BLOCKWRITE`` opcode. When ERT executes ``XAIE_IO_BLOCKWRITE``, DMA
250+
operations between host DDR and L2 memory are effected.
251+
252+
253+
Error Handling
254+
==============
255+
256+
When MERT detects an error in AMD XDNA Array, it pauses execution for that
257+
workload context and sends an asynchronous message to the driver over the
258+
privileged channel. The driver then sends a buffer pointer to MERT to capture
259+
the register states for the partition bound to faulting workload context. The
260+
driver then decodes the error by reading the contents of the buffer pointer.
261+
262+
263+
Telemetry
264+
=========
265+
266+
MERT can report various kinds of telemetry information like the following:
267+
268+
* L1 interrupt counter
269+
* DMA counter
270+
* Deep Sleep counter
271+
* etc.
272+
273+
274+
References
275+
==========
276+
277+
- `AMD XDNA Architecture <https://www.amd.com/en/technologies/xdna.html>`_
278+
- `AMD AI Engine Technology <https://www.xilinx.com/products/technology/ai-engine.html>`_
279+
- `Peano <https://github.com/Xilinx/llvm-aie>`_
280+
- `Versal Adaptive SoC AIE-ML Architecture Manual (AM020) <https://docs.amd.com/r/en-US/am020-versal-aie-ml>`_
281+
- `AI Engine Run Time <https://github.com/Xilinx/aie-rt/tree/release/main_aig>`_

Documentation/accel/amdxdna/index.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
.. SPDX-License-Identifier: GPL-2.0-only
2+
3+
=====================================
4+
accel/amdxdna NPU driver
5+
=====================================
6+
7+
The accel/amdxdna driver supports the AMD NPU (Neural Processing Unit).
8+
9+
.. toctree::
10+
11+
amdnpu

Documentation/accel/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Compute Accelerators
88
:maxdepth: 1
99

1010
introduction
11+
amdxdna/index
1112
qaic/index
1213

1314
.. only:: subproject and html

Documentation/admin-guide/cgroup-v2.rst

Lines changed: 51 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -64,13 +64,14 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
6464
5-6. Device
6565
5-7. RDMA
6666
5-7-1. RDMA Interface Files
67-
5-8. HugeTLB
68-
5.8-1. HugeTLB Interface Files
69-
5-9. Misc
70-
5.9-1 Miscellaneous cgroup Interface Files
71-
5.9-2 Migration and Ownership
72-
5-10. Others
73-
5-10-1. perf_event
67+
5-8. DMEM
68+
5-9. HugeTLB
69+
5.9-1. HugeTLB Interface Files
70+
5-10. Misc
71+
5.10-1 Miscellaneous cgroup Interface Files
72+
5.10-2 Migration and Ownership
73+
5-11. Others
74+
5-11-1. perf_event
7475
5-N. Non-normative information
7576
5-N-1. CPU controller root cgroup process behaviour
7677
5-N-2. IO controller root cgroup process behaviour
@@ -2626,6 +2627,49 @@ RDMA Interface Files
26262627
mlx4_0 hca_handle=1 hca_object=20
26272628
ocrdma1 hca_handle=1 hca_object=23
26282629

2630+
DMEM
2631+
----
2632+
2633+
The "dmem" controller regulates the distribution and accounting of
2634+
device memory regions. Because each memory region may have its own page size,
2635+
which does not have to be equal to the system page size, the units are always bytes.
2636+
2637+
DMEM Interface Files
2638+
~~~~~~~~~~~~~~~~~~~~
2639+
2640+
dmem.max, dmem.min, dmem.low
2641+
A readwrite nested-keyed file that exists for all the cgroups
2642+
except root that describes current configured resource limit
2643+
for a region.
2644+
2645+
An example for xe follows::
2646+
2647+
drm/0000:03:00.0/vram0 1073741824
2648+
drm/0000:03:00.0/stolen max
2649+
2650+
The semantics are the same as for the memory cgroup controller, and are
2651+
calculated in the same way.
2652+
2653+
dmem.capacity
2654+
A read-only file that describes maximum region capacity.
2655+
It only exists on the root cgroup. Not all memory can be
2656+
allocated by cgroups, as the kernel reserves some for
2657+
internal use.
2658+
2659+
An example for xe follows::
2660+
2661+
drm/0000:03:00.0/vram0 8514437120
2662+
drm/0000:03:00.0/stolen 67108864
2663+
2664+
dmem.current
2665+
A read-only file that describes current resource usage.
2666+
It exists for all the cgroup except root.
2667+
2668+
An example for xe follows::
2669+
2670+
drm/0000:03:00.0/vram0 12550144
2671+
drm/0000:03:00.0/stolen 8650752
2672+
26292673
HugeTLB
26302674
-------
26312675

0 commit comments

Comments
 (0)