Skip to content

IRQ locking during exceptions isn't consistent #21923

@andrewboie

Description

@andrewboie

Describe the bug
We should have well-defined behavior for whether interrupts are enabled on the current CPU when servicing an exception, like a fatal error, page fault, MPU fault, etc.

  • On x86 32-bit, they are unlocked if and only if the faulting context also had interrupts unlocked.
  • On x86-64 and native_posix they are unlocked unless the fault is from an ARCH_EXCEPT() path (oops, panic. etc) -- this is certainly a bug
  • On riscv32/64, xtensa, and Nios II, they are locked.
  • On everything else (ARM, ARC) they are unlocked.

The 32-bit x86 case is the most interesting, it has a check:

	/*
	 * restore interrupt enable state, then call the handler
	 *
	 * interrupts are enabled only if they were allowed at the time
	 * the exception was triggered -- this protects kernel level code
	 * that mustn't be interrupted
	 *
	 * Test IF bit of saved EFLAGS and re-enable interrupts if IF=1.
	 */

	/* ESP is still pointing to the ESF at this point */

	testl	$0x200, __z_arch_esf_t_eflags_OFFSET(%esp)
	je	allDone
	sti

allDone:

To Reproduce
Apply this patch to tests/kernel/fatal and run it:

diff --git a/tests/kernel/fatal/src/main.c b/tests/kernel/fatal/src/main.c
index 14ae937a05..acfd114808 100644
--- a/tests/kernel/fatal/src/main.c
+++ b/tests/kernel/fatal/src/main.c
@@ -46,8 +46,17 @@ static ZTEST_DMEM volatile int expected_reason = -1;
 
 void k_sys_fatal_error_handler(unsigned int reason, const z_arch_esf_t *pEsf)
 {
+       int key = irq_lock();
+
+       irq_unlock(key);
+
        TC_PRINT("Caught system error -- reason %d\n", reason);
 
+       if (!arch_irq_unlocked(key)) {
+               printk("irqs locked during exception handling\n");
+               k_fatal_halt(reason);
+       }
+
        if (expected_reason == -1) {
                printk("Was not expecting a crash\n");
                k_fatal_halt(reason);

Expected behavior
The expected behavior seems to be currently undefined. I certainly think we do not want them unconditionally locked.

The question is, should they be unlocked all the time, or conditional on whether the faulting context was locked?

I have some doubts on whether the x86-32 policy is really protecting anything, but I need to think about it some more.

Any fix for this should include the equivalent of the above test reflecting the policy and update our fusa requirements database.

Impact
System latency servicing interrupts when handling an exception. This is a minor issue if all exceptions are fatal errors that are not expected to occur. But if we get into a realm where exceptions are recoverable (FPU management, page faults, etc) and part of normal kernel operation, this probably breaks our real-time guarantees. We already have one-use case for MMU/MPU fault handling with arch_user_string_nlen()

Metadata

Metadata

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions