Skip to content

Commit ae8371a

Browse files
committed
Merge tag 'edac_updates_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov: - Add infrastructure support to EDAC in order to be able to register memory scrubbing RAS functionality with the kernel and expose sysfs nodes to control such scrubbing functionality. The main use case is CXL devices which provide different scrubbers for their built-in memories so that tools like rasdaemon can configure and control memory scrubbing and other, more advanced RAS functionality (Shiju Jose and Jonathan Cameron) - Add support to ie31200_edac for client SoCs like Raptor Lake-S which have multiple memory controllers and out-of-band ECC capability (Qiuxu Zhuo) - The usual round of cleanups, simplifications and fixlets * tag 'edac_updates_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: (25 commits) MAINTAINERS: Add a secondary maintainer for bluefield_edac EDAC/ie31200: Switch Raptor Lake-S to interrupt mode EDAC/ie31200: Add Intel Raptor Lake-S SoCs support EDAC/ie31200: Break up ie31200_probe1() EDAC/ie31200: Fold the two channel loops into one loop EDAC/ie31200: Make struct dimm_data contain decoded information EDAC/ie31200: Make the memory controller resources configurable EDAC/ie31200: Simplify the pci_device_id table EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info() EDAC/ie31200: Fix the error path order of ie31200_init() EDAC/ie31200: Fix the DIMM size mask for several SoCs EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer EDAC/device: Fix dev_set_name() format string EDAC/pnd2: Make read-only const array intlv static EDAC/igen6: Constify struct res_config EDAC/amd64: Simplify return statement in dct_ecc_enabled() EDAC: Update memory repair control interface for memory sparing feature EDAC: Add a memory repair control feature EDAC: Use string choice helper functions EDAC: Add a Error Check Scrub control feature ...
2 parents 2899aa3 + 298ffd5 commit ae8371a

26 files changed

+2560
-314
lines changed
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
What: /sys/bus/edac/devices/<dev-name>/ecs_fruX
2+
Date: March 2025
3+
KernelVersion: 6.15
4+
Contact: linux-edac@vger.kernel.org
5+
Description:
6+
The sysfs EDAC bus devices /<dev-name>/ecs_fruX subdirectory
7+
pertains to the memory media ECS (Error Check Scrub) control
8+
feature, where <dev-name> directory corresponds to a device
9+
registered with the EDAC device driver for the ECS feature.
10+
/ecs_fruX belongs to the media FRUs (Field Replaceable Unit)
11+
under the memory device.
12+
13+
The sysfs ECS attr nodes are only present if the parent
14+
driver has implemented the corresponding attr callback
15+
function and provided the necessary operations to the EDAC
16+
device driver during registration.
17+
18+
What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/log_entry_type
19+
Date: March 2025
20+
KernelVersion: 6.15
21+
Contact: linux-edac@vger.kernel.org
22+
Description:
23+
(RW) The log entry type of how the DDR5 ECS log is reported.
24+
25+
- 0 - per DRAM.
26+
27+
- 1 - per memory media FRU.
28+
29+
- All other values are reserved.
30+
31+
What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/mode
32+
Date: March 2025
33+
KernelVersion: 6.15
34+
Contact: linux-edac@vger.kernel.org
35+
Description:
36+
(RW) The mode of how the DDR5 ECS counts the errors.
37+
Error count is tracked based on two different modes
38+
selected by DDR5 ECS Control Feature - Codeword mode and
39+
Row Count mode. If the ECS is under Codeword mode, then
40+
the error count increments each time a codeword with check
41+
bit errors is detected. If the ECS is under Row Count mode,
42+
then the error counter increments each time a row with
43+
check bit errors is detected.
44+
45+
- 0 - ECS counts rows in the memory media that have ECC errors.
46+
47+
- 1 - ECS counts codewords with errors, specifically, it counts
48+
the number of ECC-detected errors in the memory media.
49+
50+
- All other values are reserved.
51+
52+
What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/reset
53+
Date: March 2025
54+
KernelVersion: 6.15
55+
Contact: linux-edac@vger.kernel.org
56+
Description:
57+
(WO) ECS reset ECC counter.
58+
59+
- 1 - reset ECC counter to the default value.
60+
61+
- All other values are reserved.
62+
63+
What: /sys/bus/edac/devices/<dev-name>/ecs_fruX/threshold
64+
Date: March 2025
65+
KernelVersion: 6.15
66+
Contact: linux-edac@vger.kernel.org
67+
Description:
68+
(RW) DDR5 ECS threshold count per gigabits of memory cells.
69+
The ECS error count is subject to the ECS Threshold count
70+
per Gbit, which masks error counts less than the Threshold.
71+
72+
Supported values are 256, 1024 and 4096.
73+
74+
All other values are reserved.
Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX
2+
Date: March 2025
3+
KernelVersion: 6.15
4+
Contact: linux-edac@vger.kernel.org
5+
Description:
6+
The sysfs EDAC bus devices /<dev-name>/mem_repairX subdirectory
7+
pertains to the memory media repair features control, such as
8+
PPR (Post Package Repair), memory sparing etc, where <dev-name>
9+
directory corresponds to a device registered with the EDAC
10+
device driver for the memory repair features.
11+
12+
Post Package Repair is a maintenance operation requests the memory
13+
device to perform a repair operation on its media. It is a memory
14+
self-healing feature that fixes a failing memory location by
15+
replacing it with a spare row in a DRAM device. For example, a
16+
CXL memory device with DRAM components that support PPR features may
17+
implement PPR maintenance operations. DRAM components may support
18+
two types of PPR functions: hard PPR, for a permanent row repair, and
19+
soft PPR, for a temporary row repair. Soft PPR may be much faster
20+
than hard PPR, but the repair is lost with a power cycle.
21+
22+
The sysfs attributes nodes for a repair feature are only
23+
present if the parent driver has implemented the corresponding
24+
attr callback function and provided the necessary operations
25+
to the EDAC device driver during registration.
26+
27+
In some states of system configuration (e.g. before address
28+
decoders have been configured), memory devices (e.g. CXL)
29+
may not have an active mapping in the main host address
30+
physical address map. As such, the memory to repair must be
31+
identified by a device specific physical addressing scheme
32+
using a device physical address(DPA). The DPA and other control
33+
attributes to use will be presented in related error records.
34+
35+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_type
36+
Date: March 2025
37+
KernelVersion: 6.15
38+
Contact: linux-edac@vger.kernel.org
39+
Description:
40+
(RO) Memory repair type. For eg. post package repair,
41+
memory sparing etc. Valid values are:
42+
43+
- ppr - Post package repair.
44+
45+
- cacheline-sparing
46+
47+
- row-sparing
48+
49+
- bank-sparing
50+
51+
- rank-sparing
52+
53+
- All other values are reserved.
54+
55+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode
56+
Date: March 2025
57+
KernelVersion: 6.15
58+
Contact: linux-edac@vger.kernel.org
59+
Description:
60+
(RW) Get/Set the current persist repair mode set for a
61+
repair function. Persist repair modes supported in the
62+
device, based on a memory repair function, either is temporary,
63+
which is lost with a power cycle or permanent. Valid values are:
64+
65+
- 0 - Soft memory repair (temporary repair).
66+
67+
- 1 - Hard memory repair (permanent repair).
68+
69+
- All other values are reserved.
70+
71+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_safe_when_in_use
72+
Date: March 2025
73+
KernelVersion: 6.15
74+
Contact: linux-edac@vger.kernel.org
75+
Description:
76+
(RO) True if memory media is accessible and data is retained
77+
during the memory repair operation.
78+
The data may not be retained and memory requests may not be
79+
correctly processed during a repair operation. In such case
80+
repair operation can not be executed at runtime. The memory
81+
must be taken offline.
82+
83+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/hpa
84+
Date: March 2025
85+
KernelVersion: 6.15
86+
Contact: linux-edac@vger.kernel.org
87+
Description:
88+
(RW) Host Physical Address (HPA) of the memory to repair.
89+
The HPA to use will be provided in related error records.
90+
91+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/dpa
92+
Date: March 2025
93+
KernelVersion: 6.15
94+
Contact: linux-edac@vger.kernel.org
95+
Description:
96+
(RW) Device Physical Address (DPA) of the memory to repair.
97+
The specific DPA to use will be provided in related error
98+
records.
99+
100+
In some states of system configuration (e.g. before address
101+
decoders have been configured), memory devices (e.g. CXL)
102+
may not have an active mapping in the main host address
103+
physical address map. As such, the memory to repair must be
104+
identified by a device specific physical addressing scheme
105+
using a DPA. The device physical address(DPA) to use will be
106+
presented in related error records.
107+
108+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/nibble_mask
109+
Date: March 2025
110+
KernelVersion: 6.15
111+
Contact: linux-edac@vger.kernel.org
112+
Description:
113+
(RW) Read/Write Nibble mask of the memory to repair.
114+
Nibble mask identifies one or more nibbles in error on the
115+
memory bus that produced the error event. Nibble Mask bit 0
116+
shall be set if nibble 0 on the memory bus produced the
117+
event, etc. For example, CXL PPR and sparing, a nibble mask
118+
bit set to 1 indicates the request to perform repair
119+
operation in the specific device. All nibble mask bits set
120+
to 1 indicates the request to perform the operation in all
121+
devices. Eg. for CXL memory repair, the specific value of
122+
nibble mask to use will be provided in related error records.
123+
For more details, See nibble mask field in CXL spec ver 3.1,
124+
section 8.2.9.7.1.2 Table 8-103 soft PPR and section
125+
8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4
126+
Table 8-105 memory sparing.
127+
128+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa
129+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa
130+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa
131+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa
132+
Date: March 2025
133+
KernelVersion: 6.15
134+
Contact: linux-edac@vger.kernel.org
135+
Description:
136+
(RW) The supported range of memory address that is to be
137+
repaired. The memory device may give the supported range of
138+
attributes to use and it will depend on the memory device
139+
and the portion of memory to repair.
140+
The userspace may receive the specific value of attributes
141+
to use for a repair operation from the memory device via
142+
related error records and trace events, for eg. CXL DRAM
143+
and CXL general media error records in CXL memory devices.
144+
145+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank_group
146+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank
147+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/rank
148+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/row
149+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/column
150+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/channel
151+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/sub_channel
152+
Date: March 2025
153+
KernelVersion: 6.15
154+
Contact: linux-edac@vger.kernel.org
155+
Description:
156+
(RW) The control attributes for the memory to be repaired.
157+
The specific value of attributes to use depends on the
158+
portion of memory to repair and will be reported to the host
159+
in related error records and be available to userspace
160+
in trace events, such as CXL DRAM and CXL general media
161+
error records of CXL memory devices.
162+
163+
When readng back these attributes, it returns the current
164+
value of memory requested to be repaired.
165+
166+
bank_group - The bank group of the memory to repair.
167+
168+
bank - The bank number of the memory to repair.
169+
170+
rank - The rank of the memory to repair. Rank is defined as a
171+
set of memory devices on a channel that together execute a
172+
transaction.
173+
174+
row - The row number of the memory to repair.
175+
176+
column - The column number of the memory to repair.
177+
178+
channel - The channel of the memory to repair. Channel is
179+
defined as an interface that can be independently accessed
180+
for a transaction.
181+
182+
sub_channel - The subchannel of the memory to repair.
183+
184+
The requirement to set these attributes varies based on the
185+
repair function. The attributes in sysfs are not present
186+
unless required for a repair function.
187+
188+
For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103
189+
soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR operations,
190+
these attributes are not required to set. CXL spec ver 3.1,
191+
Section 8.2.9.7.1.4 Table 8-105 memory sparing, these attributes
192+
are required to set based on memory sparing granularity.
193+
194+
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair
195+
Date: March 2025
196+
KernelVersion: 6.15
197+
Contact: linux-edac@vger.kernel.org
198+
Description:
199+
(WO) Issue the memory repair operation for the specified
200+
memory repair attributes. The operation may fail if resources
201+
are insufficient based on the requirements of the memory
202+
device and repair function.
203+
204+
- 1 - Issue the repair operation.
205+
206+
- All other values are reserved.
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
What: /sys/bus/edac/devices/<dev-name>/scrubX
2+
Date: March 2025
3+
KernelVersion: 6.15
4+
Contact: linux-edac@vger.kernel.org
5+
Description:
6+
The sysfs EDAC bus devices /<dev-name>/scrubX subdirectory
7+
belongs to an instance of memory scrub control feature,
8+
where <dev-name> directory corresponds to a device/memory
9+
region registered with the EDAC device driver for the
10+
scrub control feature.
11+
12+
The sysfs scrub attr nodes are only present if the parent
13+
driver has implemented the corresponding attr callback
14+
function and provided the necessary operations to the EDAC
15+
device driver during registration.
16+
17+
What: /sys/bus/edac/devices/<dev-name>/scrubX/addr
18+
Date: March 2025
19+
KernelVersion: 6.15
20+
Contact: linux-edac@vger.kernel.org
21+
Description:
22+
(RW) The base address of the memory region to be scrubbed
23+
for on-demand scrubbing. Setting address starts scrubbing.
24+
The size must be set before that.
25+
26+
The readback addr value is non-zero if the requested
27+
on-demand scrubbing is in progress, zero otherwise.
28+
29+
What: /sys/bus/edac/devices/<dev-name>/scrubX/size
30+
Date: March 2025
31+
KernelVersion: 6.15
32+
Contact: linux-edac@vger.kernel.org
33+
Description:
34+
(RW) The size of the memory region to be scrubbed
35+
(on-demand scrubbing).
36+
37+
What: /sys/bus/edac/devices/<dev-name>/scrubX/enable_background
38+
Date: March 2025
39+
KernelVersion: 6.15
40+
Contact: linux-edac@vger.kernel.org
41+
Description:
42+
(RW) Start/Stop background (patrol) scrubbing if supported.
43+
44+
What: /sys/bus/edac/devices/<dev-name>/scrubX/min_cycle_duration
45+
Date: March 2025
46+
KernelVersion: 6.15
47+
Contact: linux-edac@vger.kernel.org
48+
Description:
49+
(RO) Supported minimum scrub cycle duration in seconds
50+
by the memory scrubber.
51+
52+
What: /sys/bus/edac/devices/<dev-name>/scrubX/max_cycle_duration
53+
Date: March 2025
54+
KernelVersion: 6.15
55+
Contact: linux-edac@vger.kernel.org
56+
Description:
57+
(RO) Supported maximum scrub cycle duration in seconds
58+
by the memory scrubber.
59+
60+
What: /sys/bus/edac/devices/<dev-name>/scrubX/current_cycle_duration
61+
Date: March 2025
62+
KernelVersion: 6.15
63+
Contact: linux-edac@vger.kernel.org
64+
Description:
65+
(RW) The current scrub cycle duration in seconds and must be
66+
within the supported range by the memory scrubber.
67+
68+
Scrub has an overhead when running and that may want to be
69+
reduced by taking longer to do it.

0 commit comments

Comments
 (0)