|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +====================================================== |
| 4 | +Control-flow Enforcement Technology (CET) Shadow Stack |
| 5 | +====================================================== |
| 6 | + |
| 7 | +CET Background |
| 8 | +============== |
| 9 | + |
| 10 | +Control-flow Enforcement Technology (CET) covers several related x86 processor |
| 11 | +features that provide protection against control flow hijacking attacks. CET |
| 12 | +can protect both applications and the kernel. |
| 13 | + |
| 14 | +CET introduces shadow stack and indirect branch tracking (IBT). A shadow stack |
| 15 | +is a secondary stack allocated from memory which cannot be directly modified by |
| 16 | +applications. When executing a CALL instruction, the processor pushes the |
| 17 | +return address to both the normal stack and the shadow stack. Upon |
| 18 | +function return, the processor pops the shadow stack copy and compares it |
| 19 | +to the normal stack copy. If the two differ, the processor raises a |
| 20 | +control-protection fault. IBT verifies indirect CALL/JMP targets are intended |
| 21 | +as marked by the compiler with 'ENDBR' opcodes. Not all CPU's have both Shadow |
| 22 | +Stack and Indirect Branch Tracking. Today in the 64-bit kernel, only userspace |
| 23 | +shadow stack and kernel IBT are supported. |
| 24 | + |
| 25 | +Requirements to use Shadow Stack |
| 26 | +================================ |
| 27 | + |
| 28 | +To use userspace shadow stack you need HW that supports it, a kernel |
| 29 | +configured with it and userspace libraries compiled with it. |
| 30 | + |
| 31 | +The kernel Kconfig option is X86_USER_SHADOW_STACK. When compiled in, shadow |
| 32 | +stacks can be disabled at runtime with the kernel parameter: nousershstk. |
| 33 | + |
| 34 | +To build a user shadow stack enabled kernel, Binutils v2.29 or LLVM v6 or later |
| 35 | +are required. |
| 36 | + |
| 37 | +At run time, /proc/cpuinfo shows CET features if the processor supports |
| 38 | +CET. "user_shstk" means that userspace shadow stack is supported on the current |
| 39 | +kernel and HW. |
| 40 | + |
| 41 | +Application Enabling |
| 42 | +==================== |
| 43 | + |
| 44 | +An application's CET capability is marked in its ELF note and can be verified |
| 45 | +from readelf/llvm-readelf output:: |
| 46 | + |
| 47 | + readelf -n <application> | grep -a SHSTK |
| 48 | + properties: x86 feature: SHSTK |
| 49 | + |
| 50 | +The kernel does not process these applications markers directly. Applications |
| 51 | +or loaders must enable CET features using the interface described in section 4. |
| 52 | +Typically this would be done in dynamic loader or static runtime objects, as is |
| 53 | +the case in GLIBC. |
| 54 | + |
| 55 | +Enabling arch_prctl()'s |
| 56 | +======================= |
| 57 | + |
| 58 | +Elf features should be enabled by the loader using the below arch_prctl's. They |
| 59 | +are only supported in 64 bit user applications. These operate on the features |
| 60 | +on a per-thread basis. The enablement status is inherited on clone, so if the |
| 61 | +feature is enabled on the first thread, it will propagate to all the thread's |
| 62 | +in an app. |
| 63 | + |
| 64 | +arch_prctl(ARCH_SHSTK_ENABLE, unsigned long feature) |
| 65 | + Enable a single feature specified in 'feature'. Can only operate on |
| 66 | + one feature at a time. |
| 67 | + |
| 68 | +arch_prctl(ARCH_SHSTK_DISABLE, unsigned long feature) |
| 69 | + Disable a single feature specified in 'feature'. Can only operate on |
| 70 | + one feature at a time. |
| 71 | + |
| 72 | +arch_prctl(ARCH_SHSTK_LOCK, unsigned long features) |
| 73 | + Lock in features at their current enabled or disabled status. 'features' |
| 74 | + is a mask of all features to lock. All bits set are processed, unset bits |
| 75 | + are ignored. The mask is ORed with the existing value. So any feature bits |
| 76 | + set here cannot be enabled or disabled afterwards. |
| 77 | + |
| 78 | +arch_prctl(ARCH_SHSTK_UNLOCK, unsigned long features) |
| 79 | + Unlock features. 'features' is a mask of all features to unlock. All |
| 80 | + bits set are processed, unset bits are ignored. Only works via ptrace. |
| 81 | + |
| 82 | +arch_prctl(ARCH_SHSTK_STATUS, unsigned long addr) |
| 83 | + Copy the currently enabled features to the address passed in addr. The |
| 84 | + features are described using the bits passed into the others in |
| 85 | + 'features'. |
| 86 | + |
| 87 | +The return values are as follows. On success, return 0. On error, errno can |
| 88 | +be:: |
| 89 | + |
| 90 | + -EPERM if any of the passed feature are locked. |
| 91 | + -ENOTSUPP if the feature is not supported by the hardware or |
| 92 | + kernel. |
| 93 | + -EINVAL arguments (non existing feature, etc) |
| 94 | + -EFAULT if could not copy information back to userspace |
| 95 | + |
| 96 | +The feature's bits supported are:: |
| 97 | + |
| 98 | + ARCH_SHSTK_SHSTK - Shadow stack |
| 99 | + ARCH_SHSTK_WRSS - WRSS |
| 100 | + |
| 101 | +Currently shadow stack and WRSS are supported via this interface. WRSS |
| 102 | +can only be enabled with shadow stack, and is automatically disabled |
| 103 | +if shadow stack is disabled. |
| 104 | + |
| 105 | +Proc Status |
| 106 | +=========== |
| 107 | +To check if an application is actually running with shadow stack, the |
| 108 | +user can read the /proc/$PID/status. It will report "wrss" or "shstk" |
| 109 | +depending on what is enabled. The lines look like this:: |
| 110 | + |
| 111 | + x86_Thread_features: shstk wrss |
| 112 | + x86_Thread_features_locked: shstk wrss |
| 113 | + |
| 114 | +Implementation of the Shadow Stack |
| 115 | +================================== |
| 116 | + |
| 117 | +Shadow Stack Size |
| 118 | +----------------- |
| 119 | + |
| 120 | +A task's shadow stack is allocated from memory to a fixed size of |
| 121 | +MIN(RLIMIT_STACK, 4 GB). In other words, the shadow stack is allocated to |
| 122 | +the maximum size of the normal stack, but capped to 4 GB. In the case |
| 123 | +of the clone3 syscall, there is a stack size passed in and shadow stack |
| 124 | +uses this instead of the rlimit. |
| 125 | + |
| 126 | +Signal |
| 127 | +------ |
| 128 | + |
| 129 | +The main program and its signal handlers use the same shadow stack. Because |
| 130 | +the shadow stack stores only return addresses, a large shadow stack covers |
| 131 | +the condition that both the program stack and the signal alternate stack run |
| 132 | +out. |
| 133 | + |
| 134 | +When a signal happens, the old pre-signal state is pushed on the stack. When |
| 135 | +shadow stack is enabled, the shadow stack specific state is pushed onto the |
| 136 | +shadow stack. Today this is only the old SSP (shadow stack pointer), pushed |
| 137 | +in a special format with bit 63 set. On sigreturn this old SSP token is |
| 138 | +verified and restored by the kernel. The kernel will also push the normal |
| 139 | +restorer address to the shadow stack to help userspace avoid a shadow stack |
| 140 | +violation on the sigreturn path that goes through the restorer. |
| 141 | + |
| 142 | +So the shadow stack signal frame format is as follows:: |
| 143 | + |
| 144 | + |1...old SSP| - Pointer to old pre-signal ssp in sigframe token format |
| 145 | + (bit 63 set to 1) |
| 146 | + | ...| - Other state may be added in the future |
| 147 | + |
| 148 | + |
| 149 | +32 bit ABI signals are not supported in shadow stack processes. Linux prevents |
| 150 | +32 bit execution while shadow stack is enabled by the allocating shadow stacks |
| 151 | +outside of the 32 bit address space. When execution enters 32 bit mode, either |
| 152 | +via far call or returning to userspace, a #GP is generated by the hardware |
| 153 | +which, will be delivered to the process as a segfault. When transitioning to |
| 154 | +userspace the register's state will be as if the userspace ip being returned to |
| 155 | +caused the segfault. |
| 156 | + |
| 157 | +Fork |
| 158 | +---- |
| 159 | + |
| 160 | +The shadow stack's vma has VM_SHADOW_STACK flag set; its PTEs are required |
| 161 | +to be read-only and dirty. When a shadow stack PTE is not RO and dirty, a |
| 162 | +shadow access triggers a page fault with the shadow stack access bit set |
| 163 | +in the page fault error code. |
| 164 | + |
| 165 | +When a task forks a child, its shadow stack PTEs are copied and both the |
| 166 | +parent's and the child's shadow stack PTEs are cleared of the dirty bit. |
| 167 | +Upon the next shadow stack access, the resulting shadow stack page fault |
| 168 | +is handled by page copy/re-use. |
| 169 | + |
| 170 | +When a pthread child is created, the kernel allocates a new shadow stack |
| 171 | +for the new thread. New shadow stack creation behaves like mmap() with respect |
| 172 | +to ASLR behavior. Similarly, on thread exit the thread's shadow stack is |
| 173 | +disabled. |
| 174 | + |
| 175 | +Exec |
| 176 | +---- |
| 177 | + |
| 178 | +On exec, shadow stack features are disabled by the kernel. At which point, |
| 179 | +userspace can choose to re-enable, or lock them. |
0 commit comments