|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +.. _atomic_writes: |
| 3 | + |
| 4 | +Atomic Block Writes |
| 5 | +------------------------- |
| 6 | + |
| 7 | +Introduction |
| 8 | +~~~~~~~~~~~~ |
| 9 | + |
| 10 | +Atomic (untorn) block writes ensure that either the entire write is committed |
| 11 | +to disk or none of it is. This prevents "torn writes" during power loss or |
| 12 | +system crashes. The ext4 filesystem supports atomic writes (only with Direct |
| 13 | +I/O) on regular files with extents, provided the underlying storage device |
| 14 | +supports hardware atomic writes. This is supported in the following two ways: |
| 15 | + |
| 16 | +1. **Single-fsblock Atomic Writes**: |
| 17 | + EXT4's supports atomic write operations with a single filesystem block since |
| 18 | + v6.13. In this the atomic write unit minimum and maximum sizes are both set |
| 19 | + to filesystem blocksize. |
| 20 | + e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB |
| 21 | + pagesize system is possible. |
| 22 | + |
| 23 | +2. **Multi-fsblock Atomic Writes with Bigalloc**: |
| 24 | + EXT4 now also supports atomic writes spanning multiple filesystem blocks |
| 25 | + using a feature known as bigalloc. The atomic write unit's minimum and |
| 26 | + maximum sizes are determined by the filesystem block size and cluster size, |
| 27 | + based on the underlying device’s supported atomic write unit limits. |
| 28 | + |
| 29 | +Requirements |
| 30 | +~~~~~~~~~~~~ |
| 31 | + |
| 32 | +Basic requirements for atomic writes in ext4: |
| 33 | + |
| 34 | + 1. The extents feature must be enabled (default for ext4) |
| 35 | + 2. The underlying block device must support atomic writes |
| 36 | + 3. For single-fsblock atomic writes: |
| 37 | + |
| 38 | + 1. A filesystem with appropriate block size (up to the page size) |
| 39 | + 4. For multi-fsblock atomic writes: |
| 40 | + |
| 41 | + 1. The bigalloc feature must be enabled |
| 42 | + 2. The cluster size must be appropriately configured |
| 43 | + |
| 44 | +NOTE: EXT4 does not support software or COW based atomic write, which means |
| 45 | +atomic writes on ext4 are only supported if underlying storage device supports |
| 46 | +it. |
| 47 | + |
| 48 | +Multi-fsblock Implementation Details |
| 49 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 50 | + |
| 51 | +The bigalloc feature changes ext4 to allocate in units of multiple filesystem |
| 52 | +blocks, also known as clusters. With bigalloc each bit within block bitmap |
| 53 | +represents cluster (power of 2 number of blocks) rather than individual |
| 54 | +filesystem blocks. |
| 55 | +EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the |
| 56 | +following constraints. The minimum atomic write size is the larger of the fs |
| 57 | +block size and the minimum hardware atomic write unit; and the maximum atomic |
| 58 | +write size is smaller of the bigalloc cluster size and the maximum hardware |
| 59 | +atomic write unit. Bigalloc ensures that all allocations are aligned to the |
| 60 | +cluster size, which satisfies the LBA alignment requirements of the hardware |
| 61 | +device if the start of the partition/logical volume is itself aligned correctly. |
| 62 | + |
| 63 | +Here is the block allocation strategy in bigalloc for atomic writes: |
| 64 | + |
| 65 | + * For regions with fully mapped extents, no additional work is needed |
| 66 | + * For append writes, a new mapped extent is allocated |
| 67 | + * For regions that are entirely holes, unwritten extent is created |
| 68 | + * For large unwritten extents, the extent gets split into two unwritten |
| 69 | + extents of appropriate requested size |
| 70 | + * For mixed mapping regions (combinations of holes, unwritten extents, or |
| 71 | + mapped extents), ext4_map_blocks() is called in a loop with |
| 72 | + EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous |
| 73 | + mapped extent by writing zeroes to it and converting any unwritten extents to |
| 74 | + written, if found within the range. |
| 75 | + |
| 76 | +Note: Writing on a single contiguous underlying extent, whether mapped or |
| 77 | +unwritten, is not inherently problematic. However, writing to a mixed mapping |
| 78 | +region (i.e. one containing a combination of mapped and unwritten extents) |
| 79 | +must be avoided when performing atomic writes. |
| 80 | + |
| 81 | +The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC |
| 82 | +flag, requires that either all data is written or none at all. In the event of |
| 83 | +a system crash or unexpected power loss during the write operation, the affected |
| 84 | +region (when later read) must reflect either the complete old data or the |
| 85 | +complete new data, but never a mix of both. |
| 86 | + |
| 87 | +To enforce this guarantee, we ensure that the write target is backed by |
| 88 | +a single, contiguous extent before any data is written. This is critical because |
| 89 | +ext4 defers the conversion of unwritten extents to written extents until the I/O |
| 90 | +completion path (typically in ->end_io()). If a write is allowed to proceed over |
| 91 | +a mixed mapping region (with mapped and unwritten extents) and a failure occurs |
| 92 | +mid-write, the system could observe partially updated regions after reboot, i.e. |
| 93 | +new data over mapped areas, and stale (old) data over unwritten extents that |
| 94 | +were never marked written. This violates the atomicity and/or torn write |
| 95 | +prevention guarantee. |
| 96 | + |
| 97 | +To prevent such torn writes, ext4 proactively allocates a single contiguous |
| 98 | +extent for the entire requested region in ``ext4_iomap_alloc`` via |
| 99 | +``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling |
| 100 | +transaction in case if allocation is done over mixed mapping. This ensures any |
| 101 | +pending metadata updates (like unwritten to written extents conversion) in this |
| 102 | +range are in consistent state with the file data blocks, before performing the |
| 103 | +actual write I/O. If the commit fails, the whole I/O must be aborted to prevent |
| 104 | +from any possible torn writes. |
| 105 | +Only after this step, the actual data write operation is performed by the iomap. |
| 106 | + |
| 107 | +Handling Split Extents Across Leaf Blocks |
| 108 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 109 | + |
| 110 | +There can be a special edge case where we have logically and physically |
| 111 | +contiguous extents stored in separate leaf nodes of the on-disk extent tree. |
| 112 | +This occurs because on-disk extent tree merges only happens within the leaf |
| 113 | +blocks except for a case where we have 2-level tree which can get merged and |
| 114 | +collapsed entirely into the inode. |
| 115 | +If such a layout exists and, in the worst case, the extent status cache entries |
| 116 | +are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return |
| 117 | +a single contiguous extent for these split leaf extents. |
| 118 | + |
| 119 | +To address this edge case, a new get block flag |
| 120 | +``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the |
| 121 | +``ext4_map_query_blocks()`` lookup behavior. |
| 122 | + |
| 123 | +This new get block flag allows ``ext4_map_blocks()`` to first check if there is |
| 124 | +an entry in the extent status cache for the full range. |
| 125 | +If not present, it consults the on-disk extent tree using |
| 126 | +``ext4_map_query_blocks()``. |
| 127 | +If the located extent is at the end of a leaf node, it probes the next logical |
| 128 | +block (lblk) to detect a contiguous extent in the adjacent leaf. |
| 129 | + |
| 130 | +For now only one additional leaf block is queried to maintain efficiency, as |
| 131 | +atomic writes are typically constrained to small sizes |
| 132 | +(e.g. [blocksize, clustersize]). |
| 133 | + |
| 134 | + |
| 135 | +Handling Journal transactions |
| 136 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 137 | + |
| 138 | +To support multi-fsblock atomic writes, we ensure enough journal credits are |
| 139 | +reserved during: |
| 140 | + |
| 141 | + 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there |
| 142 | + could be a mixed mapping for the underlying requested range. If yes, then we |
| 143 | + reserve credits of up to ``m_len``, assuming every alternate block can be |
| 144 | + an unwritten extent followed by a hole. |
| 145 | + |
| 146 | + 2. During ``->end_io()`` call, we make sure a single transaction is started for |
| 147 | + doing unwritten-to-written conversion. The loop for conversion is mainly |
| 148 | + only required to handle a split extent across leaf blocks. |
| 149 | + |
| 150 | +How to |
| 151 | +------ |
| 152 | + |
| 153 | +Creating Filesystems with Atomic Write Support |
| 154 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 155 | + |
| 156 | +First check the atomic write units supported by block device. |
| 157 | +See :ref:`atomic_write_bdev_support` for more details. |
| 158 | + |
| 159 | +For single-fsblock atomic writes with a larger block size |
| 160 | +(on systems with block size < page size): |
| 161 | + |
| 162 | +.. code-block:: bash |
| 163 | +
|
| 164 | + # Create an ext4 filesystem with a 16KB block size |
| 165 | + # (requires page size >= 16KB) |
| 166 | + mkfs.ext4 -b 16384 /dev/device |
| 167 | +
|
| 168 | +For multi-fsblock atomic writes with bigalloc: |
| 169 | + |
| 170 | +.. code-block:: bash |
| 171 | +
|
| 172 | + # Create an ext4 filesystem with bigalloc and 64KB cluster size |
| 173 | + mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device |
| 174 | +
|
| 175 | +Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes, |
| 176 | +and ``-O bigalloc`` enables the bigalloc feature. |
| 177 | + |
| 178 | +Application Interface |
| 179 | +~~~~~~~~~~~~~~~~~~~~~ |
| 180 | + |
| 181 | +Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag |
| 182 | +to perform atomic writes: |
| 183 | + |
| 184 | +.. code-block:: c |
| 185 | +
|
| 186 | + pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC); |
| 187 | +
|
| 188 | +The write must be aligned to the filesystem's block size and not exceed the |
| 189 | +filesystem's maximum atomic write unit size. |
| 190 | +See ``generic_atomic_write_valid()`` for more details. |
| 191 | + |
| 192 | +``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following |
| 193 | +details: |
| 194 | + |
| 195 | + * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request. |
| 196 | + * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request. |
| 197 | + * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of |
| 198 | + separate memory buffers that can be gathered into a write operation |
| 199 | + (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one. |
| 200 | + |
| 201 | +The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic |
| 202 | +writes are supported. |
| 203 | + |
| 204 | +.. _atomic_write_bdev_support: |
| 205 | + |
| 206 | +Hardware Support |
| 207 | +---------------- |
| 208 | + |
| 209 | +The underlying storage device must support atomic write operations. |
| 210 | +Modern NVMe and SCSI devices often provide this capability. |
| 211 | +The Linux kernel exposes this information through sysfs: |
| 212 | + |
| 213 | +* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size |
| 214 | +* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size |
| 215 | + |
| 216 | +Nonzero values for these attributes indicate that the device supports |
| 217 | +atomic writes. |
| 218 | + |
| 219 | +See Also |
| 220 | +-------- |
| 221 | + |
| 222 | +* :doc:`bigalloc` - Documentation on the bigalloc feature |
| 223 | +* :doc:`allocators` - Documentation on block allocation in ext4 |
| 224 | +* Support for atomic block writes in 6.13: |
| 225 | + https://lwn.net/Articles/1009298/ |
0 commit comments