Skip to content

Commit d021985

Browse files
ebiggersIngo Molnar
authored andcommitted
x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs
Background: =========== Currently kernel-mode FPU is not always usable in softirq context on x86, since softirqs can nest inside a kernel-mode FPU section in task context, and nested use of kernel-mode FPU is not supported. Therefore, x86 SIMD-optimized code that can be called in softirq context has to sometimes fall back to non-SIMD code. There are two options for the fallback, both of which are pretty terrible: (a) Use a scalar fallback. This can be 10-100x slower than vectorized code because it cannot use specialized instructions like AES, SHA, or carryless multiplication. (b) Execute the request asynchronously using a kworker. In other words, use the "crypto SIMD helper" in crypto/simd.c. Currently most of the x86 en/decryption code (skcipher and aead algorithms) uses option (b), since this avoids the slow scalar fallback and it is easier to wire up. But option (b) is still really bad for its own reasons: - Punting the request to a kworker is bad for performance too. - It forces the algorithm to be marked as asynchronous (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API users who request a synchronous algorithm. That's another huge performance problem, which is especially unfortunate for users who don't even do en/decryption in softirq context. - It makes all en/decryption operations take a detour through crypto/simd.c. That involves additional checks and an additional indirect call, which slow down en/decryption for *everyone*. Fortunately, the skcipher and aead APIs are only usable in task and softirq context in the first place. Thus, if kernel-mode FPU were to be reliably usable in softirq context, no fallback would be needed. Indeed, other architectures such as arm, arm64, and riscv have already done this. Changes implemented: ==================== Therefore, this patch updates x86 accordingly to reliably support kernel-mode FPU in softirqs. This is done by just disabling softirq processing in kernel-mode FPU sections (when hardirqs are not already disabled), as that prevents the nesting that was problematic. This will delay some softirqs slightly, but only ones that would have otherwise been nested inside a task context kernel-mode FPU section. Any such softirqs would have taken the slow fallback path before if they tried to do any en/decryption. Now these softirqs will just run at the end of the task context kernel-mode FPU section (since local_bh_enable() runs pending softirqs) and will no longer take the slow fallback path. Alternatives considered: ======================== - Make kernel-mode FPU sections fully preemptible. This would require growing task_struct by another struct fpstate which is more than 2K. - Make softirqs save/restore the kernel-mode FPU state to a per-CPU struct fpstate when nested use is detected. Somewhat interesting, but seems unnecessary when a simpler solution exists. Performance results: ==================== I did some benchmarks with AES-XTS encryption of 16-byte messages (which is unrealistically small, but this makes it easier to see the overhead of kernel-mode FPU...). The baseline was 384 MB/s. Removing the use of crypto/simd.c, which this work makes possible, increases it to 487 MB/s, a +27% improvement in throughput. CPU was AMD Ryzen 9 9950X (Zen 5). No debugging options were enabled. [ mingo: Prettified the changelog and added performance results. ] Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250304204954.3901-1-ebiggers@kernel.org
1 parent 69a2fdf commit d021985

File tree

2 files changed

+20
-14
lines changed

2 files changed

+20
-14
lines changed

arch/x86/include/asm/fpu/api.h

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,9 @@
1616

1717
/*
1818
* Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It
19-
* disables preemption so be careful if you intend to use it for long periods
20-
* of time.
21-
* If you intend to use the FPU in irq/softirq you need to check first with
22-
* irq_fpu_usable() if it is possible.
19+
* disables preemption and softirq processing, so be careful if you intend to
20+
* use it for long periods of time. Kernel-mode FPU cannot be used in all
21+
* contexts -- see irq_fpu_usable() for details.
2322
*/
2423

2524
/* Kernel FPU states to initialize in kernel_fpu_begin_mask() */
@@ -50,10 +49,10 @@ static inline void kernel_fpu_begin(void)
5049
}
5150

5251
/*
53-
* Use fpregs_lock() while editing CPU's FPU registers or fpu->fpstate.
54-
* A context switch will (and softirq might) save CPU's FPU registers to
55-
* fpu->fpstate.regs and set TIF_NEED_FPU_LOAD leaving CPU's FPU registers in
56-
* a random state.
52+
* Use fpregs_lock() while editing CPU's FPU registers or fpu->fpstate, or while
53+
* using the FPU in kernel mode. A context switch will (and softirq might) save
54+
* CPU's FPU registers to fpu->fpstate.regs and set TIF_NEED_FPU_LOAD leaving
55+
* CPU's FPU registers in a random state.
5756
*
5857
* local_bh_disable() protects against both preemption and soft interrupts
5958
* on !RT kernels.
@@ -63,8 +62,6 @@ static inline void kernel_fpu_begin(void)
6362
* preemptible. Disabling preemption is the right choice here as bottom
6463
* half processing is always in thread context on RT kernels so it
6564
* implicitly prevents bottom half processing as well.
66-
*
67-
* Disabling preemption also serializes against kernel_fpu_begin().
6865
*/
6966
static inline void fpregs_lock(void)
7067
{

arch/x86/kernel/fpu/core.c

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -60,9 +60,16 @@ bool irq_fpu_usable(void)
6060
if (WARN_ON_ONCE(in_nmi()))
6161
return false;
6262

63-
/* In kernel FPU usage already active? */
64-
if (this_cpu_read(in_kernel_fpu))
63+
/*
64+
* In kernel FPU usage already active? This detects any explicitly
65+
* nested usage in task or softirq context, which is unsupported. It
66+
* also detects attempted usage in a hardirq that has interrupted a
67+
* kernel-mode FPU section.
68+
*/
69+
if (this_cpu_read(in_kernel_fpu)) {
70+
WARN_ON_FPU(!in_hardirq());
6571
return false;
72+
}
6673

6774
/*
6875
* When not in NMI or hard interrupt context, FPU can be used in:
@@ -420,7 +427,8 @@ EXPORT_SYMBOL_GPL(fpu_copy_uabi_to_guest_fpstate);
420427

421428
void kernel_fpu_begin_mask(unsigned int kfpu_mask)
422429
{
423-
preempt_disable();
430+
if (!irqs_disabled())
431+
fpregs_lock();
424432

425433
WARN_ON_FPU(!irq_fpu_usable());
426434
WARN_ON_FPU(this_cpu_read(in_kernel_fpu));
@@ -448,7 +456,8 @@ void kernel_fpu_end(void)
448456
WARN_ON_FPU(!this_cpu_read(in_kernel_fpu));
449457

450458
this_cpu_write(in_kernel_fpu, false);
451-
preempt_enable();
459+
if (!irqs_disabled())
460+
fpregs_unlock();
452461
}
453462
EXPORT_SYMBOL_GPL(kernel_fpu_end);
454463

0 commit comments

Comments
 (0)