IRQ locking during exceptions isn't consistent

**Describe the bug**
We should have well-defined behavior for whether interrupts are enabled on the current CPU when servicing an exception, like a fatal error, page fault, MPU fault, etc.

* On x86 32-bit, they are unlocked if and only if the faulting context also had interrupts unlocked.
* On x86-64 and native_posix they are unlocked unless the fault is from an ARCH_EXCEPT() path (oops, panic. etc) -- this is certainly a bug
* On riscv32/64, xtensa, and Nios II, they are locked.
* On everything else (ARM, ARC) they are unlocked.

The 32-bit x86 case is the most interesting, it has a check:
```
	/*
	 * restore interrupt enable state, then call the handler
	 *
	 * interrupts are enabled only if they were allowed at the time
	 * the exception was triggered -- this protects kernel level code
	 * that mustn't be interrupted
	 *
	 * Test IF bit of saved EFLAGS and re-enable interrupts if IF=1.
	 */

	/* ESP is still pointing to the ESF at this point */

	testl	$0x200, __z_arch_esf_t_eflags_OFFSET(%esp)
	je	allDone
	sti

allDone:
```

**To Reproduce**
Apply this patch to tests/kernel/fatal and run it:

```
diff --git a/tests/kernel/fatal/src/main.c b/tests/kernel/fatal/src/main.c
index 14ae937a05..acfd114808 100644
--- a/tests/kernel/fatal/src/main.c
+++ b/tests/kernel/fatal/src/main.c
@@ -46,8 +46,17 @@ static ZTEST_DMEM volatile int expected_reason = -1;
 
 void k_sys_fatal_error_handler(unsigned int reason, const z_arch_esf_t *pEsf)
 {
+       int key = irq_lock();
+
+       irq_unlock(key);
+
        TC_PRINT("Caught system error -- reason %d\n", reason);
 
+       if (!arch_irq_unlocked(key)) {
+               printk("irqs locked during exception handling\n");
+               k_fatal_halt(reason);
+       }
+
        if (expected_reason == -1) {
                printk("Was not expecting a crash\n");
                k_fatal_halt(reason);
```

**Expected behavior**
The expected behavior seems to be currently undefined. I certainly think we do not want them unconditionally locked.

The question is, should they be unlocked all the time, or conditional on whether the faulting context was locked?

I have some doubts on whether the x86-32 policy is really protecting anything, but I need to think about it some more.

Any fix for this should include the equivalent of the above test reflecting the policy and update our fusa requirements database.

**Impact**
System latency servicing interrupts when handling an exception. This is a minor issue if all exceptions are fatal errors that are not expected to occur. But if we get into a realm where exceptions are recoverable (FPU management, page faults, etc) and part of normal kernel operation, this probably breaks our real-time guarantees. We already have one-use case for MMU/MPU fault handling with arch_user_string_nlen()


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IRQ locking during exceptions isn't consistent #21923

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

IRQ locking during exceptions isn't consistent #21923

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions