|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +===================== |
| 4 | +Introduction of mseal |
| 5 | +===================== |
| 6 | + |
| 7 | +:Author: Jeff Xu <jeffxu@chromium.org> |
| 8 | + |
| 9 | +Modern CPUs support memory permissions such as RW and NX bits. The memory |
| 10 | +permission feature improves security stance on memory corruption bugs, i.e. |
| 11 | +the attacker can’t just write to arbitrary memory and point the code to it, |
| 12 | +the memory has to be marked with X bit, or else an exception will happen. |
| 13 | + |
| 14 | +Memory sealing additionally protects the mapping itself against |
| 15 | +modifications. This is useful to mitigate memory corruption issues where a |
| 16 | +corrupted pointer is passed to a memory management system. For example, |
| 17 | +such an attacker primitive can break control-flow integrity guarantees |
| 18 | +since read-only memory that is supposed to be trusted can become writable |
| 19 | +or .text pages can get remapped. Memory sealing can automatically be |
| 20 | +applied by the runtime loader to seal .text and .rodata pages and |
| 21 | +applications can additionally seal security critical data at runtime. |
| 22 | + |
| 23 | +A similar feature already exists in the XNU kernel with the |
| 24 | +VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2]. |
| 25 | + |
| 26 | +User API |
| 27 | +======== |
| 28 | +mseal() |
| 29 | +----------- |
| 30 | +The mseal() syscall has the following signature: |
| 31 | + |
| 32 | +``int mseal(void addr, size_t len, unsigned long flags)`` |
| 33 | + |
| 34 | +**addr/len**: virtual memory address range. |
| 35 | + |
| 36 | +The address range set by ``addr``/``len`` must meet: |
| 37 | + - The start address must be in an allocated VMA. |
| 38 | + - The start address must be page aligned. |
| 39 | + - The end address (``addr`` + ``len``) must be in an allocated VMA. |
| 40 | + - no gap (unallocated memory) between start and end address. |
| 41 | + |
| 42 | +The ``len`` will be paged aligned implicitly by the kernel. |
| 43 | + |
| 44 | +**flags**: reserved for future use. |
| 45 | + |
| 46 | +**return values**: |
| 47 | + |
| 48 | +- ``0``: Success. |
| 49 | + |
| 50 | +- ``-EINVAL``: |
| 51 | + - Invalid input ``flags``. |
| 52 | + - The start address (``addr``) is not page aligned. |
| 53 | + - Address range (``addr`` + ``len``) overflow. |
| 54 | + |
| 55 | +- ``-ENOMEM``: |
| 56 | + - The start address (``addr``) is not allocated. |
| 57 | + - The end address (``addr`` + ``len``) is not allocated. |
| 58 | + - A gap (unallocated memory) between start and end address. |
| 59 | + |
| 60 | +- ``-EPERM``: |
| 61 | + - sealing is supported only on 64-bit CPUs, 32-bit is not supported. |
| 62 | + |
| 63 | +- For above error cases, users can expect the given memory range is |
| 64 | + unmodified, i.e. no partial update. |
| 65 | + |
| 66 | +- There might be other internal errors/cases not listed here, e.g. |
| 67 | + error during merging/splitting VMAs, or the process reaching the max |
| 68 | + number of supported VMAs. In those cases, partial updates to the given |
| 69 | + memory range could happen. However, those cases should be rare. |
| 70 | + |
| 71 | +**Blocked operations after sealing**: |
| 72 | + Unmapping, moving to another location, and shrinking the size, |
| 73 | + via munmap() and mremap(), can leave an empty space, therefore |
| 74 | + can be replaced with a VMA with a new set of attributes. |
| 75 | + |
| 76 | + Moving or expanding a different VMA into the current location, |
| 77 | + via mremap(). |
| 78 | + |
| 79 | + Modifying a VMA via mmap(MAP_FIXED). |
| 80 | + |
| 81 | + Size expansion, via mremap(), does not appear to pose any |
| 82 | + specific risks to sealed VMAs. It is included anyway because |
| 83 | + the use case is unclear. In any case, users can rely on |
| 84 | + merging to expand a sealed VMA. |
| 85 | + |
| 86 | + mprotect() and pkey_mprotect(). |
| 87 | + |
| 88 | + Some destructive madvice() behaviors (e.g. MADV_DONTNEED) |
| 89 | + for anonymous memory, when users don't have write permission to the |
| 90 | + memory. Those behaviors can alter region contents by discarding pages, |
| 91 | + effectively a memset(0) for anonymous memory. |
| 92 | + |
| 93 | + Kernel will return -EPERM for blocked operations. |
| 94 | + |
| 95 | + For blocked operations, one can expect the given address is unmodified, |
| 96 | + i.e. no partial update. Note, this is different from existing mm |
| 97 | + system call behaviors, where partial updates are made till an error is |
| 98 | + found and returned to userspace. To give an example: |
| 99 | + |
| 100 | + Assume following code sequence: |
| 101 | + |
| 102 | + - ptr = mmap(null, 8192, PROT_NONE); |
| 103 | + - munmap(ptr + 4096, 4096); |
| 104 | + - ret1 = mprotect(ptr, 8192, PROT_READ); |
| 105 | + - mseal(ptr, 4096); |
| 106 | + - ret2 = mprotect(ptr, 8192, PROT_NONE); |
| 107 | + |
| 108 | + ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ. |
| 109 | + |
| 110 | + ret2 will be -EPERM, the page remains to be PROT_READ. |
| 111 | + |
| 112 | +**Note**: |
| 113 | + |
| 114 | +- mseal() only works on 64-bit CPUs, not 32-bit CPU. |
| 115 | + |
| 116 | +- users can call mseal() multiple times, mseal() on an already sealed memory |
| 117 | + is a no-action (not error). |
| 118 | + |
| 119 | +- munseal() is not supported. |
| 120 | + |
| 121 | +Use cases: |
| 122 | +========== |
| 123 | +- glibc: |
| 124 | + The dynamic linker, during loading ELF executables, can apply sealing to |
| 125 | + non-writable memory segments. |
| 126 | + |
| 127 | +- Chrome browser: protect some security sensitive data-structures. |
| 128 | + |
| 129 | +Notes on which memory to seal: |
| 130 | +============================== |
| 131 | + |
| 132 | +It might be important to note that sealing changes the lifetime of a mapping, |
| 133 | +i.e. the sealed mapping won’t be unmapped till the process terminates or the |
| 134 | +exec system call is invoked. Applications can apply sealing to any virtual |
| 135 | +memory region from userspace, but it is crucial to thoroughly analyze the |
| 136 | +mapping's lifetime prior to apply the sealing. |
| 137 | + |
| 138 | +For example: |
| 139 | + |
| 140 | +- aio/shm |
| 141 | + |
| 142 | + aio/shm can call mmap()/munmap() on behalf of userspace, e.g. ksys_shmdt() in |
| 143 | + shm.c. The lifetime of those mapping are not tied to the lifetime of the |
| 144 | + process. If those memories are sealed from userspace, then munmap() will fail, |
| 145 | + causing leaks in VMA address space during the lifetime of the process. |
| 146 | + |
| 147 | +- Brk (heap) |
| 148 | + |
| 149 | + Currently, userspace applications can seal parts of the heap by calling |
| 150 | + malloc() and mseal(). |
| 151 | + let's assume following calls from user space: |
| 152 | + |
| 153 | + - ptr = malloc(size); |
| 154 | + - mprotect(ptr, size, RO); |
| 155 | + - mseal(ptr, size); |
| 156 | + - free(ptr); |
| 157 | + |
| 158 | + Technically, before mseal() is added, the user can change the protection of |
| 159 | + the heap by calling mprotect(RO). As long as the user changes the protection |
| 160 | + back to RW before free(), the memory range can be reused. |
| 161 | + |
| 162 | + Adding mseal() into the picture, however, the heap is then sealed partially, |
| 163 | + the user can still free it, but the memory remains to be RO. If the address |
| 164 | + is re-used by the heap manager for another malloc, the process might crash |
| 165 | + soon after. Therefore, it is important not to apply sealing to any memory |
| 166 | + that might get recycled. |
| 167 | + |
| 168 | + Furthermore, even if the application never calls the free() for the ptr, |
| 169 | + the heap manager may invoke the brk system call to shrink the size of the |
| 170 | + heap. In the kernel, the brk-shrink will call munmap(). Consequently, |
| 171 | + depending on the location of the ptr, the outcome of brk-shrink is |
| 172 | + nondeterministic. |
| 173 | + |
| 174 | + |
| 175 | +Additional notes: |
| 176 | +================= |
| 177 | +As Jann Horn pointed out in [3], there are still a few ways to write |
| 178 | +to RO memory, which is, in a way, by design. Those cases are not covered |
| 179 | +by mseal(). If applications want to block such cases, sandbox tools (such as |
| 180 | +seccomp, LSM, etc) might be considered. |
| 181 | + |
| 182 | +Those cases are: |
| 183 | + |
| 184 | +- Write to read-only memory through /proc/self/mem interface. |
| 185 | +- Write to read-only memory through ptrace (such as PTRACE_POKETEXT). |
| 186 | +- userfaultfd. |
| 187 | + |
| 188 | +The idea that inspired this patch comes from Stephen Röttger’s work in V8 |
| 189 | +CFI [4]. Chrome browser in ChromeOS will be the first user of this API. |
| 190 | + |
| 191 | +Reference: |
| 192 | +========== |
| 193 | +[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274 |
| 194 | + |
| 195 | +[2] https://man.openbsd.org/mimmutable.2 |
| 196 | + |
| 197 | +[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com |
| 198 | + |
| 199 | +[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc |
0 commit comments