Skip to content

Commit 6f991cc

Browse files
Eric DeVolderakpm00
authored andcommitted
crash: move a few code bits to setup support of crash hotplug
Patch series "crash: Kernel handling of CPU and memory hot un/plug", v28. Once the kdump service is loaded, if changes to CPUs or memory occur, either by hot un/plug or off/onlining, the crash elfcorehdr must also be updated. The elfcorehdr describes to kdump the CPUs and memory in the system, and any inaccuracies can result in a vmcore with missing CPU context or memory regions. The current solution utilizes udev to initiate an unload-then-reload of the kdump image (eg. kernel, initrd, boot_params, purgatory and elfcorehdr) by the userspace kexec utility. In the original post I outlined the significant performance problems related to offloading this activity to userspace. This patchset introduces a generic crash handler that registers with the CPU and memory notifiers. Upon CPU or memory changes, from either hot un/plug or off/onlining, this generic handler is invoked and performs important housekeeping, for example obtaining the appropriate lock, and then invokes an architecture specific handler to do the appropriate elfcorehdr update. Note the description in patch 'crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()' and 'x86/crash: optimize CPU changes' that enables further optimizations related to CPU plug/unplug/online/offline performance of elfcorehdr updates. In the case of x86_64, the arch specific handler generates a new elfcorehdr, and overwrites the old one in memory; thus no involvement with userspace needed. To realize the benefits/test this patchset, one must make a couple of minor changes to userspace: - Prevent udev from updating kdump crash kernel on hot un/plug changes. Add the following as the first lines to the RHEL udev rule file /usr/lib/udev/rules.d/98-kexec.rules: # The kernel updates the crash elfcorehdr for CPU and memory changes SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end" With this changeset applied, the two rules evaluate to false for CPU and memory change events and thus skip the userspace unload-then-reload of kdump. - Change to the kexec_file_load for loading the kdump kernel: Eg. on RHEL: in /usr/bin/kdumpctl, change to: standard_kexec_args="-p -d -s" which adds the -s to select kexec_file_load() syscall. This kernel patchset also supports kexec_load() with a modified kexec userspace utility. A working changeset to the kexec userspace utility is posted to the kexec-tools mailing list here: http://lists.infradead.org/pipermail/kexec/2023-May/027049.html To use the kexec-tools patch, apply, build and install kexec-tools, then change the kdumpctl's standard_kexec_args to replace the -s with --hotplug. The removal of -s reverts to the kexec_load syscall and the addition of --hotplug invokes the changes put forth in the kexec-tools patch. This patch (of 8): The crash hotplug support leans on the work for the kexec_file_load() syscall. To also support the kexec_load() syscall, a few bits of code need to be move outside of CONFIG_KEXEC_FILE. As such, these bits are moved out of kexec_file.c and into a common location crash_core.c. In addition, struct crash_mem and crash_notes were moved to new locales so that PROC_KCORE, which sets CRASH_CORE alone, builds correctly. No functionality change intended. Link: https://lkml.kernel.org/r/20230814214446.6659-1-eric.devolder@oracle.com Link: https://lkml.kernel.org/r/20230814214446.6659-2-eric.devolder@oracle.com Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Reviewed-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Akhil Raj <lf32.dev@gmail.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Sean Christopherson <seanjc@google.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 3d0b713 commit 6f991cc

File tree

5 files changed

+238
-233
lines changed

5 files changed

+238
-233
lines changed

include/linux/crash_core.h

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@
2828
VMCOREINFO_BYTES)
2929

3030
typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
31+
/* Per cpu memory for storing cpu states in case of system crash. */
32+
extern note_buf_t __percpu *crash_notes;
3133

3234
void crash_update_vmcoreinfo_safecopy(void *ptr);
3335
void crash_save_vmcoreinfo(void);
@@ -84,4 +86,22 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram,
8486
int parse_crashkernel_low(char *cmdline, unsigned long long system_ram,
8587
unsigned long long *crash_size, unsigned long long *crash_base);
8688

89+
/* Alignment required for elf header segment */
90+
#define ELF_CORE_HEADER_ALIGN 4096
91+
92+
struct crash_mem {
93+
unsigned int max_nr_ranges;
94+
unsigned int nr_ranges;
95+
struct range ranges[];
96+
};
97+
98+
extern int crash_exclude_mem_range(struct crash_mem *mem,
99+
unsigned long long mstart,
100+
unsigned long long mend);
101+
extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
102+
void **addr, unsigned long *sz);
103+
104+
struct kimage;
105+
struct kexec_segment;
106+
87107
#endif /* LINUX_CRASH_CORE_H */

include/linux/kexec.h

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -230,21 +230,6 @@ static inline int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
230230
}
231231
#endif
232232

233-
/* Alignment required for elf header segment */
234-
#define ELF_CORE_HEADER_ALIGN 4096
235-
236-
struct crash_mem {
237-
unsigned int max_nr_ranges;
238-
unsigned int nr_ranges;
239-
struct range ranges[];
240-
};
241-
242-
extern int crash_exclude_mem_range(struct crash_mem *mem,
243-
unsigned long long mstart,
244-
unsigned long long mend);
245-
extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
246-
void **addr, unsigned long *sz);
247-
248233
#ifndef arch_kexec_apply_relocations_add
249234
/*
250235
* arch_kexec_apply_relocations_add - apply relocations of type RELA

kernel/crash_core.c

Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
#include <linux/utsname.h>
1111
#include <linux/vmalloc.h>
1212
#include <linux/sizes.h>
13+
#include <linux/kexec.h>
1314

1415
#include <asm/page.h>
1516
#include <asm/sections.h>
@@ -18,6 +19,9 @@
1819

1920
#include "kallsyms_internal.h"
2021

22+
/* Per cpu memory for storing cpu states in case of system crash. */
23+
note_buf_t __percpu *crash_notes;
24+
2125
/* vmcoreinfo stuff */
2226
unsigned char *vmcoreinfo_data;
2327
size_t vmcoreinfo_size;
@@ -314,6 +318,187 @@ static int __init parse_crashkernel_dummy(char *arg)
314318
}
315319
early_param("crashkernel", parse_crashkernel_dummy);
316320

321+
int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
322+
void **addr, unsigned long *sz)
323+
{
324+
Elf64_Ehdr *ehdr;
325+
Elf64_Phdr *phdr;
326+
unsigned long nr_cpus = num_possible_cpus(), nr_phdr, elf_sz;
327+
unsigned char *buf;
328+
unsigned int cpu, i;
329+
unsigned long long notes_addr;
330+
unsigned long mstart, mend;
331+
332+
/* extra phdr for vmcoreinfo ELF note */
333+
nr_phdr = nr_cpus + 1;
334+
nr_phdr += mem->nr_ranges;
335+
336+
/*
337+
* kexec-tools creates an extra PT_LOAD phdr for kernel text mapping
338+
* area (for example, ffffffff80000000 - ffffffffa0000000 on x86_64).
339+
* I think this is required by tools like gdb. So same physical
340+
* memory will be mapped in two ELF headers. One will contain kernel
341+
* text virtual addresses and other will have __va(physical) addresses.
342+
*/
343+
344+
nr_phdr++;
345+
elf_sz = sizeof(Elf64_Ehdr) + nr_phdr * sizeof(Elf64_Phdr);
346+
elf_sz = ALIGN(elf_sz, ELF_CORE_HEADER_ALIGN);
347+
348+
buf = vzalloc(elf_sz);
349+
if (!buf)
350+
return -ENOMEM;
351+
352+
ehdr = (Elf64_Ehdr *)buf;
353+
phdr = (Elf64_Phdr *)(ehdr + 1);
354+
memcpy(ehdr->e_ident, ELFMAG, SELFMAG);
355+
ehdr->e_ident[EI_CLASS] = ELFCLASS64;
356+
ehdr->e_ident[EI_DATA] = ELFDATA2LSB;
357+
ehdr->e_ident[EI_VERSION] = EV_CURRENT;
358+
ehdr->e_ident[EI_OSABI] = ELF_OSABI;
359+
memset(ehdr->e_ident + EI_PAD, 0, EI_NIDENT - EI_PAD);
360+
ehdr->e_type = ET_CORE;
361+
ehdr->e_machine = ELF_ARCH;
362+
ehdr->e_version = EV_CURRENT;
363+
ehdr->e_phoff = sizeof(Elf64_Ehdr);
364+
ehdr->e_ehsize = sizeof(Elf64_Ehdr);
365+
ehdr->e_phentsize = sizeof(Elf64_Phdr);
366+
367+
/* Prepare one phdr of type PT_NOTE for each present CPU */
368+
for_each_present_cpu(cpu) {
369+
phdr->p_type = PT_NOTE;
370+
notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
371+
phdr->p_offset = phdr->p_paddr = notes_addr;
372+
phdr->p_filesz = phdr->p_memsz = sizeof(note_buf_t);
373+
(ehdr->e_phnum)++;
374+
phdr++;
375+
}
376+
377+
/* Prepare one PT_NOTE header for vmcoreinfo */
378+
phdr->p_type = PT_NOTE;
379+
phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
380+
phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
381+
(ehdr->e_phnum)++;
382+
phdr++;
383+
384+
/* Prepare PT_LOAD type program header for kernel text region */
385+
if (need_kernel_map) {
386+
phdr->p_type = PT_LOAD;
387+
phdr->p_flags = PF_R|PF_W|PF_X;
388+
phdr->p_vaddr = (unsigned long) _text;
389+
phdr->p_filesz = phdr->p_memsz = _end - _text;
390+
phdr->p_offset = phdr->p_paddr = __pa_symbol(_text);
391+
ehdr->e_phnum++;
392+
phdr++;
393+
}
394+
395+
/* Go through all the ranges in mem->ranges[] and prepare phdr */
396+
for (i = 0; i < mem->nr_ranges; i++) {
397+
mstart = mem->ranges[i].start;
398+
mend = mem->ranges[i].end;
399+
400+
phdr->p_type = PT_LOAD;
401+
phdr->p_flags = PF_R|PF_W|PF_X;
402+
phdr->p_offset = mstart;
403+
404+
phdr->p_paddr = mstart;
405+
phdr->p_vaddr = (unsigned long) __va(mstart);
406+
phdr->p_filesz = phdr->p_memsz = mend - mstart + 1;
407+
phdr->p_align = 0;
408+
ehdr->e_phnum++;
409+
pr_debug("Crash PT_LOAD ELF header. phdr=%p vaddr=0x%llx, paddr=0x%llx, sz=0x%llx e_phnum=%d p_offset=0x%llx\n",
410+
phdr, phdr->p_vaddr, phdr->p_paddr, phdr->p_filesz,
411+
ehdr->e_phnum, phdr->p_offset);
412+
phdr++;
413+
}
414+
415+
*addr = buf;
416+
*sz = elf_sz;
417+
return 0;
418+
}
419+
420+
int crash_exclude_mem_range(struct crash_mem *mem,
421+
unsigned long long mstart, unsigned long long mend)
422+
{
423+
int i, j;
424+
unsigned long long start, end, p_start, p_end;
425+
struct range temp_range = {0, 0};
426+
427+
for (i = 0; i < mem->nr_ranges; i++) {
428+
start = mem->ranges[i].start;
429+
end = mem->ranges[i].end;
430+
p_start = mstart;
431+
p_end = mend;
432+
433+
if (mstart > end || mend < start)
434+
continue;
435+
436+
/* Truncate any area outside of range */
437+
if (mstart < start)
438+
p_start = start;
439+
if (mend > end)
440+
p_end = end;
441+
442+
/* Found completely overlapping range */
443+
if (p_start == start && p_end == end) {
444+
mem->ranges[i].start = 0;
445+
mem->ranges[i].end = 0;
446+
if (i < mem->nr_ranges - 1) {
447+
/* Shift rest of the ranges to left */
448+
for (j = i; j < mem->nr_ranges - 1; j++) {
449+
mem->ranges[j].start =
450+
mem->ranges[j+1].start;
451+
mem->ranges[j].end =
452+
mem->ranges[j+1].end;
453+
}
454+
455+
/*
456+
* Continue to check if there are another overlapping ranges
457+
* from the current position because of shifting the above
458+
* mem ranges.
459+
*/
460+
i--;
461+
mem->nr_ranges--;
462+
continue;
463+
}
464+
mem->nr_ranges--;
465+
return 0;
466+
}
467+
468+
if (p_start > start && p_end < end) {
469+
/* Split original range */
470+
mem->ranges[i].end = p_start - 1;
471+
temp_range.start = p_end + 1;
472+
temp_range.end = end;
473+
} else if (p_start != start)
474+
mem->ranges[i].end = p_start - 1;
475+
else
476+
mem->ranges[i].start = p_end + 1;
477+
break;
478+
}
479+
480+
/* If a split happened, add the split to array */
481+
if (!temp_range.end)
482+
return 0;
483+
484+
/* Split happened */
485+
if (i == mem->max_nr_ranges - 1)
486+
return -ENOMEM;
487+
488+
/* Location where new range should go */
489+
j = i + 1;
490+
if (j < mem->nr_ranges) {
491+
/* Move over all ranges one slot towards the end */
492+
for (i = mem->nr_ranges - 1; i >= j; i--)
493+
mem->ranges[i + 1] = mem->ranges[i];
494+
}
495+
496+
mem->ranges[j].start = temp_range.start;
497+
mem->ranges[j].end = temp_range.end;
498+
mem->nr_ranges++;
499+
return 0;
500+
}
501+
317502
Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
318503
void *data, size_t data_len)
319504
{
@@ -515,3 +700,36 @@ static int __init crash_save_vmcoreinfo_init(void)
515700
}
516701

517702
subsys_initcall(crash_save_vmcoreinfo_init);
703+
704+
static int __init crash_notes_memory_init(void)
705+
{
706+
/* Allocate memory for saving cpu registers. */
707+
size_t size, align;
708+
709+
/*
710+
* crash_notes could be allocated across 2 vmalloc pages when percpu
711+
* is vmalloc based . vmalloc doesn't guarantee 2 continuous vmalloc
712+
* pages are also on 2 continuous physical pages. In this case the
713+
* 2nd part of crash_notes in 2nd page could be lost since only the
714+
* starting address and size of crash_notes are exported through sysfs.
715+
* Here round up the size of crash_notes to the nearest power of two
716+
* and pass it to __alloc_percpu as align value. This can make sure
717+
* crash_notes is allocated inside one physical page.
718+
*/
719+
size = sizeof(note_buf_t);
720+
align = min(roundup_pow_of_two(sizeof(note_buf_t)), PAGE_SIZE);
721+
722+
/*
723+
* Break compile if size is bigger than PAGE_SIZE since crash_notes
724+
* definitely will be in 2 pages with that.
725+
*/
726+
BUILD_BUG_ON(size > PAGE_SIZE);
727+
728+
crash_notes = __alloc_percpu(size, align);
729+
if (!crash_notes) {
730+
pr_warn("Memory allocation for saving cpu register states failed\n");
731+
return -ENOMEM;
732+
}
733+
return 0;
734+
}
735+
subsys_initcall(crash_notes_memory_init);

kernel/kexec_core.c

Lines changed: 0 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,6 @@
4949

5050
atomic_t __kexec_lock = ATOMIC_INIT(0);
5151

52-
/* Per cpu memory for storing cpu states in case of system crash. */
53-
note_buf_t __percpu *crash_notes;
54-
5552
/* Flag to indicate we are going to kexec a new kernel */
5653
bool kexec_in_progress = false;
5754

@@ -1218,40 +1215,6 @@ void crash_save_cpu(struct pt_regs *regs, int cpu)
12181215
final_note(buf);
12191216
}
12201217

1221-
static int __init crash_notes_memory_init(void)
1222-
{
1223-
/* Allocate memory for saving cpu registers. */
1224-
size_t size, align;
1225-
1226-
/*
1227-
* crash_notes could be allocated across 2 vmalloc pages when percpu
1228-
* is vmalloc based . vmalloc doesn't guarantee 2 continuous vmalloc
1229-
* pages are also on 2 continuous physical pages. In this case the
1230-
* 2nd part of crash_notes in 2nd page could be lost since only the
1231-
* starting address and size of crash_notes are exported through sysfs.
1232-
* Here round up the size of crash_notes to the nearest power of two
1233-
* and pass it to __alloc_percpu as align value. This can make sure
1234-
* crash_notes is allocated inside one physical page.
1235-
*/
1236-
size = sizeof(note_buf_t);
1237-
align = min(roundup_pow_of_two(sizeof(note_buf_t)), PAGE_SIZE);
1238-
1239-
/*
1240-
* Break compile if size is bigger than PAGE_SIZE since crash_notes
1241-
* definitely will be in 2 pages with that.
1242-
*/
1243-
BUILD_BUG_ON(size > PAGE_SIZE);
1244-
1245-
crash_notes = __alloc_percpu(size, align);
1246-
if (!crash_notes) {
1247-
pr_warn("Memory allocation for saving cpu register states failed\n");
1248-
return -ENOMEM;
1249-
}
1250-
return 0;
1251-
}
1252-
subsys_initcall(crash_notes_memory_init);
1253-
1254-
12551218
/*
12561219
* Move into place and start executing a preloaded standalone
12571220
* executable. If nothing was preloaded return an error.

0 commit comments

Comments
 (0)