Skip to content

Commit 6f15e61

Browse files
yghannambp3tk0v
authored andcommitted
RAS: Introduce a FRU memory poison manager
Memory errors are an expected occurrence on systems with high memory density. Generally, errors within a small number of unique physical locations are acceptable, based on manufacturer and/or admin policy. During run time, memory with errors may be retired so it is no longer used by the system. This is done in mm through page poisoning, and the effect will remain until the system is restarted. If a memory location is consistently faulty, then the same run time error handling may occur in the next reboot cycle, leading to terminating jobs due to that already known bad memory. This could be prevented if information from the previous boot was not lost. Some add-in cards with driver-managed memory have on-board persistent storage. Their driver saves memory error information to the persistent storage during run time. The information is then restored after reset, and known bad memory will be retired before the hardware is used. A running log of bad memory locations is kept across multiple resets. A similar solution is desirable for CPUs. However, this solution should leverage industry-standard components as much as possible, rather than a bespoke platform driver. Two components are needed: a record format and a persistent storage interface. Implement a new module to manage the record formats on persistent storage. Use the requirements for an AMD MI300-based system to start. Vendor- and platform-specific details can be abstracted later as needed. [ bp: Massage commit message and code, squash 30-ish more fixes from Yazen and me. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Co-developed-by: <naveenkrishna.chatradhi@amd.com> Signed-off-by: <naveenkrishna.chatradhi@amd.com> Co-developed-by: <muralidhara.mk@amd.com> Signed-off-by: <muralidhara.mk@amd.com> Tested-by: <sathyapriya.k@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@amd.com
1 parent 3b566b3 commit 6f15e61

File tree

4 files changed

+831
-0
lines changed

4 files changed

+831
-0
lines changed

MAINTAINERS

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18363,6 +18363,12 @@ F: drivers/ras/
1836318363
F: include/linux/ras.h
1836418364
F: include/ras/ras_event.h
1836518365

18366+
RAS FRU MEMORY POISON MANAGER (FMPM)
18367+
M: Yazen Ghannam <Yazen.Ghannam@amd.com>
18368+
L: linux-edac@vger.kernel.org
18369+
S: Maintained
18370+
F: drivers/ras/amd/fmpm.c
18371+
1836618372
RC-CORE / LIRC FRAMEWORK
1836718373
M: Sean Young <sean@mess.org>
1836818374
L: linux-media@vger.kernel.org

drivers/ras/Kconfig

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,4 +34,16 @@ if RAS
3434
source "arch/x86/ras/Kconfig"
3535
source "drivers/ras/amd/atl/Kconfig"
3636

37+
config RAS_FMPM
38+
tristate "FRU Memory Poison Manager"
39+
default m
40+
depends on AMD_ATL && ACPI_APEI
41+
help
42+
Support saving and restoring memory error information across reboot
43+
using ACPI ERST as persistent storage. Error information is saved with
44+
the UEFI CPER "FRU Memory Poison" section format.
45+
46+
Memory will be retired during boot time and run time depending on
47+
platform-specific policies.
48+
3749
endif

drivers/ras/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,5 @@ obj-$(CONFIG_RAS) += ras.o
33
obj-$(CONFIG_DEBUG_FS) += debugfs.o
44
obj-$(CONFIG_RAS_CEC) += cec.o
55

6+
obj-$(CONFIG_RAS_FMPM) += amd/fmpm.o
67
obj-y += amd/atl/

0 commit comments

Comments
 (0)