Skip to content

Commit ac4ad5c

Browse files
Liao Changctmarinas
authored andcommitted
arm64: insn: Simulate nop instruction for better uprobe performance
v2->v1: 1. Remove the simuation of STP and the related bits. 2. Use arm64_skip_faulting_instruction for single-stepping or FEAT_BTI scenario. As Andrii pointed out, the uprobe/uretprobe selftest bench run into a counterintuitive result that nop and push variants are much slower than ret variant [0]. The root cause lies in the arch_probe_analyse_insn(), which excludes 'nop' and 'stp' from the emulatable instructions list. This force the kernel returns to userspace and execute them out-of-line, then trapping back to kernel for running uprobe callback functions. This leads to a significant performance overhead compared to 'ret' variant, which is already emulated. Typicall uprobe is installed on 'nop' for USDT and on function entry which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr and fp into stack regardless kernel or userspace binary. In order to improve the performance of handling uprobe for common usecases. This patch supports the emulation of Arm64 equvialents instructions of 'nop' and 'push'. The benchmark results below indicates the performance gain of emulation is obvious. On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. xol (1 cpus) ------------ uprobe-nop: 0.916 ± 0.001M/s (0.916M/prod) uprobe-push: 0.908 ± 0.001M/s (0.908M/prod) uprobe-ret: 1.855 ± 0.000M/s (1.855M/prod) uretprobe-nop: 0.640 ± 0.000M/s (0.640M/prod) uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod) uretprobe-ret: 0.978 ± 0.003M/s (0.978M/prod) emulation (1 cpus) ------------------- uprobe-nop: 1.862 ± 0.002M/s (1.862M/prod) uprobe-push: 1.743 ± 0.006M/s (1.743M/prod) uprobe-ret: 1.840 ± 0.001M/s (1.840M/prod) uretprobe-nop: 0.964 ± 0.004M/s (0.964M/prod) uretprobe-push: 0.936 ± 0.004M/s (0.936M/prod) uretprobe-ret: 0.940 ± 0.001M/s (0.940M/prod) As shown above, the performance gap between 'nop/push' and 'ret' variants has been significantly reduced. Due to the emulation of 'push' instruction needs to access userspace memory, it spent more cycles than the other. As Mark suggested [1], it is painful to emulate the correct atomicity and ordering properties of STP, especially when it interacts with MTE, POE, etc. So this patch just focus on the simuation of 'nop'. The simluation of STP and related changes will be addressed in a separate patch. [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ CC: Andrii Nakryiko <andrii.nakryiko@gmail.com> CC: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Liao Chang <liaochang1@huawei.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20240909071114.1150053-1-liaochang1@huawei.com [catalin.marinas@arm.com: small tweaks following MarkR's comments] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
1 parent 1476210 commit ac4ad5c

File tree

4 files changed

+21
-0
lines changed

4 files changed

+21
-0
lines changed

arch/arm64/include/asm/insn.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -575,6 +575,11 @@ static __always_inline u32 aarch64_insn_gen_nop(void)
575575
return aarch64_insn_gen_hint(AARCH64_INSN_HINT_NOP);
576576
}
577577

578+
static __always_inline bool aarch64_insn_is_nop(u32 insn)
579+
{
580+
return insn == aarch64_insn_gen_nop();
581+
}
582+
578583
u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register reg,
579584
enum aarch64_insn_branch_type type);
580585
u32 aarch64_insn_gen_load_store_reg(enum aarch64_insn_register reg,

arch/arm64/kernel/probes/decode-insn.c

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,15 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
7575
enum probe_insn __kprobes
7676
arm_probe_decode_insn(u32 insn, struct arch_probe_insn *api)
7777
{
78+
/*
79+
* While 'nop' instruction can execute in the out-of-line slot,
80+
* simulating them in breakpoint handling offers better performance.
81+
*/
82+
if (aarch64_insn_is_nop(insn)) {
83+
api->handler = simulate_nop;
84+
return INSN_GOOD_NO_SLOT;
85+
}
86+
7887
/*
7988
* Instructions reading or modifying the PC won't work from the XOL
8089
* slot.

arch/arm64/kernel/probes/simulate-insn.c

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,3 +196,9 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
196196

197197
instruction_pointer_set(regs, instruction_pointer(regs) + 4);
198198
}
199+
200+
void __kprobes
201+
simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
202+
{
203+
arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
204+
}

arch/arm64/kernel/probes/simulate-insn.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,5 +16,6 @@ void simulate_cbz_cbnz(u32 opcode, long addr, struct pt_regs *regs);
1616
void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs);
1717
void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs);
1818
void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs);
19+
void simulate_nop(u32 opcode, long addr, struct pt_regs *regs);
1920

2021
#endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */

0 commit comments

Comments
 (0)