Skip to content

Commit d87d738

Browse files
committed
Merge tag 'ext4_for_linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o: "New ext4 features and performance improvements: - Fast commit performance improvements - Multi-fsblock atomic write support for bigalloc file systems - Large folio support for regular files This last can result in really stupendous performance for the right workloads. For example, see [1] where the Kernel Test Robot reported over 37% improvement on a large sequential I/O workload. There are also the usual bug fixes and cleanups. Of note are cleanups of the extent status tree to fix potential races that could result in the extent status tree getting corrupted under heavy simultaneous allocation and deallocation to a single file" Link: https://lore.kernel.org/all/202505161418.ec0d753f-lkp@intel.com/ [1] * tag 'ext4_for_linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits) ext4: Add a WARN_ON_ONCE for querying LAST_IN_LEAF instead ext4: Simplify flags in ext4_map_query_blocks() ext4: Rename and document EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER ext4: Simplify last in leaf check in ext4_map_query_blocks ext4: Unwritten to written conversion requires EXT4_EX_NOCACHE ext4: only dirty folios when data journaling regular files ext4: Add atomic block write documentation ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc ext4: Add multi-fsblock atomic write support with bigalloc ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS ext4: Make ext4_meta_trans_blocks() non-static for later use ext4: Check if inode uses extents in ext4_inode_can_atomic_write() ext4: Document an edge case for overwrites jbd2: remove journal_t argument from jbd2_superblock_csum() jbd2: remove journal_t argument from jbd2_chksum() ext4: remove sb argument from ext4_superblock_csum() ext4: remove sbi argument from ext4_chksum() ext4: enable large folio for regular file ext4: make online defragmentation support large folios ext4: make the writeback path support large folios ...
2 parents e9d7126 + 7acd1b3 commit d87d738

27 files changed

+1290
-474
lines changed
Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
.. _atomic_writes:
3+
4+
Atomic Block Writes
5+
-------------------------
6+
7+
Introduction
8+
~~~~~~~~~~~~
9+
10+
Atomic (untorn) block writes ensure that either the entire write is committed
11+
to disk or none of it is. This prevents "torn writes" during power loss or
12+
system crashes. The ext4 filesystem supports atomic writes (only with Direct
13+
I/O) on regular files with extents, provided the underlying storage device
14+
supports hardware atomic writes. This is supported in the following two ways:
15+
16+
1. **Single-fsblock Atomic Writes**:
17+
EXT4's supports atomic write operations with a single filesystem block since
18+
v6.13. In this the atomic write unit minimum and maximum sizes are both set
19+
to filesystem blocksize.
20+
e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
21+
pagesize system is possible.
22+
23+
2. **Multi-fsblock Atomic Writes with Bigalloc**:
24+
EXT4 now also supports atomic writes spanning multiple filesystem blocks
25+
using a feature known as bigalloc. The atomic write unit's minimum and
26+
maximum sizes are determined by the filesystem block size and cluster size,
27+
based on the underlying device’s supported atomic write unit limits.
28+
29+
Requirements
30+
~~~~~~~~~~~~
31+
32+
Basic requirements for atomic writes in ext4:
33+
34+
1. The extents feature must be enabled (default for ext4)
35+
2. The underlying block device must support atomic writes
36+
3. For single-fsblock atomic writes:
37+
38+
1. A filesystem with appropriate block size (up to the page size)
39+
4. For multi-fsblock atomic writes:
40+
41+
1. The bigalloc feature must be enabled
42+
2. The cluster size must be appropriately configured
43+
44+
NOTE: EXT4 does not support software or COW based atomic write, which means
45+
atomic writes on ext4 are only supported if underlying storage device supports
46+
it.
47+
48+
Multi-fsblock Implementation Details
49+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
50+
51+
The bigalloc feature changes ext4 to allocate in units of multiple filesystem
52+
blocks, also known as clusters. With bigalloc each bit within block bitmap
53+
represents cluster (power of 2 number of blocks) rather than individual
54+
filesystem blocks.
55+
EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
56+
following constraints. The minimum atomic write size is the larger of the fs
57+
block size and the minimum hardware atomic write unit; and the maximum atomic
58+
write size is smaller of the bigalloc cluster size and the maximum hardware
59+
atomic write unit. Bigalloc ensures that all allocations are aligned to the
60+
cluster size, which satisfies the LBA alignment requirements of the hardware
61+
device if the start of the partition/logical volume is itself aligned correctly.
62+
63+
Here is the block allocation strategy in bigalloc for atomic writes:
64+
65+
* For regions with fully mapped extents, no additional work is needed
66+
* For append writes, a new mapped extent is allocated
67+
* For regions that are entirely holes, unwritten extent is created
68+
* For large unwritten extents, the extent gets split into two unwritten
69+
extents of appropriate requested size
70+
* For mixed mapping regions (combinations of holes, unwritten extents, or
71+
mapped extents), ext4_map_blocks() is called in a loop with
72+
EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
73+
mapped extent by writing zeroes to it and converting any unwritten extents to
74+
written, if found within the range.
75+
76+
Note: Writing on a single contiguous underlying extent, whether mapped or
77+
unwritten, is not inherently problematic. However, writing to a mixed mapping
78+
region (i.e. one containing a combination of mapped and unwritten extents)
79+
must be avoided when performing atomic writes.
80+
81+
The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
82+
flag, requires that either all data is written or none at all. In the event of
83+
a system crash or unexpected power loss during the write operation, the affected
84+
region (when later read) must reflect either the complete old data or the
85+
complete new data, but never a mix of both.
86+
87+
To enforce this guarantee, we ensure that the write target is backed by
88+
a single, contiguous extent before any data is written. This is critical because
89+
ext4 defers the conversion of unwritten extents to written extents until the I/O
90+
completion path (typically in ->end_io()). If a write is allowed to proceed over
91+
a mixed mapping region (with mapped and unwritten extents) and a failure occurs
92+
mid-write, the system could observe partially updated regions after reboot, i.e.
93+
new data over mapped areas, and stale (old) data over unwritten extents that
94+
were never marked written. This violates the atomicity and/or torn write
95+
prevention guarantee.
96+
97+
To prevent such torn writes, ext4 proactively allocates a single contiguous
98+
extent for the entire requested region in ``ext4_iomap_alloc`` via
99+
``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling
100+
transaction in case if allocation is done over mixed mapping. This ensures any
101+
pending metadata updates (like unwritten to written extents conversion) in this
102+
range are in consistent state with the file data blocks, before performing the
103+
actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
104+
from any possible torn writes.
105+
Only after this step, the actual data write operation is performed by the iomap.
106+
107+
Handling Split Extents Across Leaf Blocks
108+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
109+
110+
There can be a special edge case where we have logically and physically
111+
contiguous extents stored in separate leaf nodes of the on-disk extent tree.
112+
This occurs because on-disk extent tree merges only happens within the leaf
113+
blocks except for a case where we have 2-level tree which can get merged and
114+
collapsed entirely into the inode.
115+
If such a layout exists and, in the worst case, the extent status cache entries
116+
are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
117+
a single contiguous extent for these split leaf extents.
118+
119+
To address this edge case, a new get block flag
120+
``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
121+
``ext4_map_query_blocks()`` lookup behavior.
122+
123+
This new get block flag allows ``ext4_map_blocks()`` to first check if there is
124+
an entry in the extent status cache for the full range.
125+
If not present, it consults the on-disk extent tree using
126+
``ext4_map_query_blocks()``.
127+
If the located extent is at the end of a leaf node, it probes the next logical
128+
block (lblk) to detect a contiguous extent in the adjacent leaf.
129+
130+
For now only one additional leaf block is queried to maintain efficiency, as
131+
atomic writes are typically constrained to small sizes
132+
(e.g. [blocksize, clustersize]).
133+
134+
135+
Handling Journal transactions
136+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
137+
138+
To support multi-fsblock atomic writes, we ensure enough journal credits are
139+
reserved during:
140+
141+
1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
142+
could be a mixed mapping for the underlying requested range. If yes, then we
143+
reserve credits of up to ``m_len``, assuming every alternate block can be
144+
an unwritten extent followed by a hole.
145+
146+
2. During ``->end_io()`` call, we make sure a single transaction is started for
147+
doing unwritten-to-written conversion. The loop for conversion is mainly
148+
only required to handle a split extent across leaf blocks.
149+
150+
How to
151+
------
152+
153+
Creating Filesystems with Atomic Write Support
154+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
155+
156+
First check the atomic write units supported by block device.
157+
See :ref:`atomic_write_bdev_support` for more details.
158+
159+
For single-fsblock atomic writes with a larger block size
160+
(on systems with block size < page size):
161+
162+
.. code-block:: bash
163+
164+
# Create an ext4 filesystem with a 16KB block size
165+
# (requires page size >= 16KB)
166+
mkfs.ext4 -b 16384 /dev/device
167+
168+
For multi-fsblock atomic writes with bigalloc:
169+
170+
.. code-block:: bash
171+
172+
# Create an ext4 filesystem with bigalloc and 64KB cluster size
173+
mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
174+
175+
Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
176+
and ``-O bigalloc`` enables the bigalloc feature.
177+
178+
Application Interface
179+
~~~~~~~~~~~~~~~~~~~~~
180+
181+
Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
182+
to perform atomic writes:
183+
184+
.. code-block:: c
185+
186+
pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
187+
188+
The write must be aligned to the filesystem's block size and not exceed the
189+
filesystem's maximum atomic write unit size.
190+
See ``generic_atomic_write_valid()`` for more details.
191+
192+
``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
193+
details:
194+
195+
* ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
196+
* ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
197+
* ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
198+
separate memory buffers that can be gathered into a write operation
199+
(e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
200+
201+
The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
202+
writes are supported.
203+
204+
.. _atomic_write_bdev_support:
205+
206+
Hardware Support
207+
----------------
208+
209+
The underlying storage device must support atomic write operations.
210+
Modern NVMe and SCSI devices often provide this capability.
211+
The Linux kernel exposes this information through sysfs:
212+
213+
* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
214+
* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
215+
216+
Nonzero values for these attributes indicate that the device supports
217+
atomic writes.
218+
219+
See Also
220+
--------
221+
222+
* :doc:`bigalloc` - Documentation on the bigalloc feature
223+
* :doc:`allocators` - Documentation on block allocation in ext4
224+
* Support for atomic block writes in 6.13:
225+
https://lwn.net/Articles/1009298/

Documentation/filesystems/ext4/overview.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,4 @@ order.
2525
.. include:: inlinedata.rst
2626
.. include:: eainode.rst
2727
.. include:: verity.rst
28+
.. include:: atomic_writes.rst

fs/ext4/bitmap.c

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ int ext4_inode_bitmap_csum_verify(struct super_block *sb,
3030

3131
sz = EXT4_INODES_PER_GROUP(sb) >> 3;
3232
provided = le16_to_cpu(gdp->bg_inode_bitmap_csum_lo);
33-
calculated = ext4_chksum(sbi, sbi->s_csum_seed, (__u8 *)bh->b_data, sz);
33+
calculated = ext4_chksum(sbi->s_csum_seed, (__u8 *)bh->b_data, sz);
3434
if (sbi->s_desc_size >= EXT4_BG_INODE_BITMAP_CSUM_HI_END) {
3535
hi = le16_to_cpu(gdp->bg_inode_bitmap_csum_hi);
3636
provided |= (hi << 16);
@@ -52,7 +52,7 @@ void ext4_inode_bitmap_csum_set(struct super_block *sb,
5252
return;
5353

5454
sz = EXT4_INODES_PER_GROUP(sb) >> 3;
55-
csum = ext4_chksum(sbi, sbi->s_csum_seed, (__u8 *)bh->b_data, sz);
55+
csum = ext4_chksum(sbi->s_csum_seed, (__u8 *)bh->b_data, sz);
5656
gdp->bg_inode_bitmap_csum_lo = cpu_to_le16(csum & 0xFFFF);
5757
if (sbi->s_desc_size >= EXT4_BG_INODE_BITMAP_CSUM_HI_END)
5858
gdp->bg_inode_bitmap_csum_hi = cpu_to_le16(csum >> 16);
@@ -71,7 +71,7 @@ int ext4_block_bitmap_csum_verify(struct super_block *sb,
7171
return 1;
7272

7373
provided = le16_to_cpu(gdp->bg_block_bitmap_csum_lo);
74-
calculated = ext4_chksum(sbi, sbi->s_csum_seed, (__u8 *)bh->b_data, sz);
74+
calculated = ext4_chksum(sbi->s_csum_seed, (__u8 *)bh->b_data, sz);
7575
if (sbi->s_desc_size >= EXT4_BG_BLOCK_BITMAP_CSUM_HI_END) {
7676
hi = le16_to_cpu(gdp->bg_block_bitmap_csum_hi);
7777
provided |= (hi << 16);
@@ -92,7 +92,7 @@ void ext4_block_bitmap_csum_set(struct super_block *sb,
9292
if (!ext4_has_feature_metadata_csum(sb))
9393
return;
9494

95-
csum = ext4_chksum(sbi, sbi->s_csum_seed, (__u8 *)bh->b_data, sz);
95+
csum = ext4_chksum(sbi->s_csum_seed, (__u8 *)bh->b_data, sz);
9696
gdp->bg_block_bitmap_csum_lo = cpu_to_le16(csum & 0xFFFF);
9797
if (sbi->s_desc_size >= EXT4_BG_BLOCK_BITMAP_CSUM_HI_END)
9898
gdp->bg_block_bitmap_csum_hi = cpu_to_le16(csum >> 16);

0 commit comments

Comments
 (0)