Skip to content

Commit cde5c32

Browse files
mrutland-armwilldeacon
authored andcommitted
arm64/fpsimd: Make clone() compatible with ZA lazy saving
Linux is intended to be compatible with userspace written to Arm's AAPCS64 procedure call standard [1,2]. For the Scalable Matrix Extension (SME), AAPCS64 was extended with a "ZA lazy saving scheme", where SME's ZA tile is lazily callee-saved and caller-restored. In this scheme, TPIDR2_EL0 indicates whether the ZA tile is live or has been saved by pointing to a "TPIDR2 block" in memory, which has a "za_save_buffer" pointer. This scheme has been implemented in GCC and LLVM, with necessary runtime support implemented in glibc and bionic. AAPCS64 does not specify how the ZA lazy saving scheme is expected to interact with thread creation mechanisms such as fork() and pthread_create(), which would be implemented in terms of the Linux clone syscall. The behaviour implemented by Linux and glibc/bionic doesn't always compose safely, as explained below. Currently the clone syscall is implemented such that PSTATE.ZA and the ZA tile are always inherited by the new task, and TPIDR2_EL0 is inherited unless the 'flags' argument includes CLONE_SETTLS, in which case TPIDR2_EL0 is set to 0/NULL. This doesn't make much sense: (a) TPIDR2_EL0 is part of the calling convention, and changes as control is passed between functions. It is *NOT* used for thread local storage, despite superficial similarity to TPIDR_EL0, which is is used as the TLS register. (b) TPIDR2_EL0 and PSTATE.ZA are tightly coupled in the procedure call standard, and some combinations of states are illegal. In general, manipulating the two independently is not guaranteed to be safe. In practice, code which is compliant with the procedure call standard may issue a clone syscall while in the "ZA dormant" state, where PSTATE.ZA==1 and TPIDR2_EL0 is non-null and indicates that ZA needs to be saved. This can cause a variety of problems, including: * If the implementation of pthread_create() passes CLONE_SETTLS, the new thread will start with PSTATE.ZA==1 and TPIDR2==NULL. Per the procedure call standard this is not a legitimate state for most functions. This can cause data corruption (e.g. as code may rely on PSTATE.ZA being 0 to guarantee that an SMSTART ZA instruction will zero the ZA tile contents), and may result in other undefined behaviour. * If the implementation of pthread_create() does not pass CLONE_SETTLS, the new thread will start with PSTATE.ZA==1 and TPIDR2 pointing to a TPIDR2 block on the parent thread's stack. This can result in a variety of problems, e.g. - The child may write back to the parent's za_save_buffer, corrupting its contents. - The child may read from the TPIDR2 block after the parent has reused this memory for something else, and consequently the child may abort or clobber arbitrary memory. Ideally we'd require that userspace ensures that a task is in the "ZA off" state (with PSTATE.ZA==0 and TPIDR2_EL0==NULL) prior to issuing a clone syscall, and have the kernel force this state for new threads. Unfortunately, contemporary C libraries do not do this, and simply forcing this state within the implementation of clone would break fork(). Instead, we can bodge around this by considering the CLONE_VM flag, and manipulate PSTATE.ZA and TPIDR2_EL0 as a pair. CLONE_VM indicates that the new task will run in the same address space as its parent, and in that case it doesn't make sense to inherit a stale pointer to the parent's TPIDR2 block: * For fork(), CLONE_VM will not be set, and it is safe to inherit both PSTATE.ZA and TPIDR2_EL0 as the new task will have its own copy of the address space, and cannot clobber its parent's stack. * For pthread_create() and vfork(), CLONE_VM will be set, and discarding PSTATE.ZA and TPIDR2_EL0 for the new task doesn't break any existing assumptions in userspace. Implement this behaviour for clone(). We currently inherit PSTATE.ZA in arch_dup_task_struct(), but this does not have access to the clone flags, so move this logic under copy_thread(). Documentation is updated to describe the new behaviour. [1] https://github.com/ARM-software/abi-aa/releases/download/2025Q1/aapcs64.pdf [2] https://github.com/ARM-software/abi-aa/blob/c51addc3dc03e73a016a1e4edf25440bcac76431/aapcs64/aapcs64.rst Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Kiss <daniel.kiss@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Richard Sandiford <richard.sandiford@arm.com> Cc: Sander De Smalen <sander.desmalen@arm.com> Cc: Tamas Petz <tamas.petz@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yury Khrustalev <yury.khrustalev@arm.com> Acked-by: Yury Khrustalev <yury.khrustalev@arm.com> Link: https://lore.kernel.org/r/20250508132644.1395904-14-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
1 parent a6d066f commit cde5c32

File tree

2 files changed

+64
-32
lines changed

2 files changed

+64
-32
lines changed

Documentation/arch/arm64/sme.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,8 +69,8 @@ model features for SME is included in Appendix A.
6969
vectors from 0 to VL/8-1 stored in the same endianness invariant format as is
7070
used for SVE vectors.
7171

72-
* On thread creation TPIDR2_EL0 is preserved unless CLONE_SETTLS is specified,
73-
in which case it is set to 0.
72+
* On thread creation PSTATE.ZA and TPIDR2_EL0 are preserved unless CLONE_VM
73+
is specified, in which case PSTATE.ZA is set to 0 and TPIDR2_EL0 is set to 0.
7474

7575
2. Vector lengths
7676
------------------

arch/arm64/kernel/process.c

Lines changed: 62 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -364,38 +364,46 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
364364
task_smstop_sm(dst);
365365

366366
/*
367-
* In the unlikely event that we create a new thread with ZA
368-
* enabled we should retain the ZA and ZT state so duplicate
369-
* it here. This may be shortly freed if we exec() or if
370-
* CLONE_SETTLS but it's simpler to do it here. To avoid
371-
* confusing the rest of the code ensure that we have a
372-
* sve_state allocated whenever sme_state is allocated.
367+
* Drop stale reference to src's sme_state and ensure dst has ZA
368+
* disabled.
369+
*
370+
* When necessary, ZA will be inherited later in copy_thread_za().
373371
*/
374-
if (thread_za_enabled(&src->thread)) {
375-
dst->thread.sve_state = kzalloc(sve_state_size(src),
376-
GFP_KERNEL);
377-
if (!dst->thread.sve_state)
378-
return -ENOMEM;
379-
380-
dst->thread.sme_state = kmemdup(src->thread.sme_state,
381-
sme_state_size(src),
382-
GFP_KERNEL);
383-
if (!dst->thread.sme_state) {
384-
kfree(dst->thread.sve_state);
385-
dst->thread.sve_state = NULL;
386-
return -ENOMEM;
387-
}
388-
} else {
389-
dst->thread.sme_state = NULL;
390-
clear_tsk_thread_flag(dst, TIF_SME);
391-
}
372+
dst->thread.sme_state = NULL;
373+
clear_tsk_thread_flag(dst, TIF_SME);
374+
dst->thread.svcr &= ~SVCR_ZA_MASK;
392375

393376
/* clear any pending asynchronous tag fault raised by the parent */
394377
clear_tsk_thread_flag(dst, TIF_MTE_ASYNC_FAULT);
395378

396379
return 0;
397380
}
398381

382+
static int copy_thread_za(struct task_struct *dst, struct task_struct *src)
383+
{
384+
if (!thread_za_enabled(&src->thread))
385+
return 0;
386+
387+
dst->thread.sve_state = kzalloc(sve_state_size(src),
388+
GFP_KERNEL);
389+
if (!dst->thread.sve_state)
390+
return -ENOMEM;
391+
392+
dst->thread.sme_state = kmemdup(src->thread.sme_state,
393+
sme_state_size(src),
394+
GFP_KERNEL);
395+
if (!dst->thread.sme_state) {
396+
kfree(dst->thread.sve_state);
397+
dst->thread.sve_state = NULL;
398+
return -ENOMEM;
399+
}
400+
401+
set_tsk_thread_flag(dst, TIF_SME);
402+
dst->thread.svcr |= SVCR_ZA_MASK;
403+
404+
return 0;
405+
}
406+
399407
asmlinkage void ret_from_fork(void) asm("ret_from_fork");
400408

401409
int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
@@ -428,8 +436,6 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
428436
* out-of-sync with the saved value.
429437
*/
430438
*task_user_tls(p) = read_sysreg(tpidr_el0);
431-
if (system_supports_tpidr2())
432-
p->thread.tpidr2_el0 = read_sysreg_s(SYS_TPIDR2_EL0);
433439

434440
if (system_supports_poe())
435441
p->thread.por_el0 = read_sysreg_s(SYS_POR_EL0);
@@ -441,14 +447,40 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
441447
childregs->sp = stack_start;
442448
}
443449

450+
/*
451+
* Due to the AAPCS64 "ZA lazy saving scheme", PSTATE.ZA and
452+
* TPIDR2 need to be manipulated as a pair, and either both
453+
* need to be inherited or both need to be reset.
454+
*
455+
* Within a process, child threads must not inherit their
456+
* parent's TPIDR2 value or they may clobber their parent's
457+
* stack at some later point.
458+
*
459+
* When a process is fork()'d, the child must inherit ZA and
460+
* TPIDR2 from its parent in case there was dormant ZA state.
461+
*
462+
* Use CLONE_VM to determine when the child will share the
463+
* address space with the parent, and cannot safely inherit the
464+
* state.
465+
*/
466+
if (system_supports_sme()) {
467+
if (!(clone_flags & CLONE_VM)) {
468+
p->thread.tpidr2_el0 = read_sysreg_s(SYS_TPIDR2_EL0);
469+
ret = copy_thread_za(p, current);
470+
if (ret)
471+
return ret;
472+
} else {
473+
p->thread.tpidr2_el0 = 0;
474+
WARN_ON_ONCE(p->thread.svcr & SVCR_ZA_MASK);
475+
}
476+
}
477+
444478
/*
445479
* If a TLS pointer was passed to clone, use it for the new
446-
* thread. We also reset TPIDR2 if it's in use.
480+
* thread.
447481
*/
448-
if (clone_flags & CLONE_SETTLS) {
482+
if (clone_flags & CLONE_SETTLS)
449483
p->thread.uw.tp_value = tls;
450-
p->thread.tpidr2_el0 = 0;
451-
}
452484

453485
ret = copy_thread_gcs(p, args);
454486
if (ret != 0)

0 commit comments

Comments
 (0)