Skip to content

Commit 4a4b30e

Browse files
committed
Merge tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs
Pull bcachefs updates from Kent Overstreet: "On disk format is now soft frozen: no more required/automatic are anticipated before taking off the experimental label. Major changes/features since 6.14: - Scrub - Blocksize greater than page size support - A number of "rebalance spinning and doing no work" issues have been fixed; we now check if the write allocation will succeed in bch2_data_update_init(), before kicking off the read. There's still more work to do in this area. Later we may want to add another bitset btree, like rebalance_work, to track "extents that rebalance was requested to move but couldn't", e.g. due to destination target having insufficient online devices. - We can now support scaling well into the petabyte range: latest bcachefs-tools will pick an appropriate bucket size at format time to ensure fsck can run in available memory (e.g. a server with 256GB of ram and 100PB of storage would want 16MB buckets). On disk format changes: - 1.21: cached backpointers (scalability improvement) Cached replicas now get backpointers, which means we no longer rely on incrementing bucket generation numbers to invalidate cached data: this lets us get rid of the bucket generation number garbage collection, which had to periodically rescan all extents to recompute bucket oldest_gen. Bucket generation numbers are now only used as a consistency check, but they're quite useful for that. - 1.22: stripe backpointers Stripes now have backpointers: erasure coded stripes have their own checksums, separate from the checksums for the extents they contain (and stripe checksums also cover the parity blocks). This is required for implementing scrub for stripes. - 1.23: stripe lru (scalability improvement) Persistent lru for stripes, ordered by "number of empty blocks". This is used by the stripe creation path, which depending on free space may create a new stripe out of a partially empty existing stripe instead of starting a brand new stripe. This replaces an in-memory heap, and means we no longer have to read in the stripes btree at startup. - 1.24: casefolding Case insensitive directory support, courtesy of Valve. This is an incompatible feature, to enable mount with -o version_upgrade=incompatible - 1.25: extent_flags Another incompatible feature requiring explicit opt-in to enable. This adds a flags entry to extents, and a flag bit that marks extents as poisoned. A poisoned extent is an extent that was unreadable due to checksum errors. We can't move such extents without giving them a new checksum, and we may have to move them (for e.g. copygc or device evacuate). We also don't want to delete them: in the future we'll have an API that lets userspace ignore checksum errors and attempt to deal with simple bitrot itself. Marking them as poisoned lets us continue to return the correct error to userspace on normal read calls. Other changes/features: - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top' command, which shows a live view of all internal filesystem counters. - Improved journal pipelining: we can now have 16 journal writes in flight concurrently, up from 4. We're logging significantly more to the journal than we used to with all the recent disk accounting changes and additions, so some users should see a performance increase on some workloads. - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to devices marked as failed. Now we will attempt to read from them, but only if we have no better options. - New option, write_error_timeout: devices will be kicked out of the filesystem if all writes have been failing for x number of seconds. We now also kick devices out when notified by blk_holder_ops that they've gone offline. - Device option handling improvements: the discard option should now be working as expected (additionally, in -tools, all device options that can be set at format time can now be set at device add time, i.e. data_allowed, state). - We now try harder to read data after a checksum error: we'll do additional retries if necessary to a device after after it gave us data with a checksum error. - More self healing work: the full inode <-> dirent consistency checks that are currently run by fsck are now also run every time we do a lookup, meaning we'll be able to correct errors at runtime. Runtime self healing will be flipped on after the new changes have seen more testing, currently they're just checking for consistency. - KMSAN fixes: our KMSAN builds should be nearly clean now, which will put a massive dent in the syzbot dashboard" * tag 'bcachefs-2025-03-24' of git://evilpiepirate.org/bcachefs: (180 commits) bcachefs: Kill unnecessary bch2_dev_usage_read() bcachefs: btree node write errors now print btree node bcachefs: Fix race in print_chain() bcachefs: btree_trans_restart_foreign_task() bcachefs: bch2_disk_accounting_mod2() bcachefs: zero init journal bios bcachefs: Eliminate padding in move_bucket_key bcachefs: Fix a KMSAN splat in btree_update_nodes_written() bcachefs: kmsan asserts bcachefs: Fix kmsan warnings in bch2_extent_crc_pack() bcachefs: Disable asm memcpys when kmsan enabled bcachefs: Handle backpointers with unknown data types bcachefs: Count BCH_DATA_parity backpointers correctly bcachefs: Run bch2_check_dirent_target() at lookup time bcachefs: Refactor bch2_check_dirent_target() bcachefs: Move bch2_check_dirent_target() to namei.c bcachefs: fs-common.c -> namei.c bcachefs: EIO cleanup bcachefs: bch2_write_prep_encoded_data() now returns errcode bcachefs: Simplify bch2_write_op_error() ...
2 parents f79adee + d8bdc8d commit 4a4b30e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

116 files changed

+4816
-2944
lines changed

Documentation/filesystems/bcachefs/SubmittingPatches.rst

Lines changed: 25 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,13 @@
1-
Submitting patches to bcachefs:
2-
===============================
1+
Submitting patches to bcachefs
2+
==============================
3+
4+
Here are suggestions for submitting patches to bcachefs subsystem.
5+
6+
Submission checklist
7+
--------------------
38

49
Patches must be tested before being submitted, either with the xfstests suite
5-
[0], or the full bcachefs test suite in ktest [1], depending on what's being
10+
[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being
611
touched. Note that ktest wraps xfstests and will be an easier method to running
712
it for most users; it includes single-command wrappers for all the mainstream
813
in-kernel local filesystems.
@@ -26,21 +31,21 @@ considered out of date), but try not to deviate too much without reason.
2631
Focus on writing code that reads well and is organized well; code should be
2732
aesthetically pleasing.
2833

29-
CI:
30-
===
34+
CI
35+
--
3136

3237
Instead of running your tests locally, when running the full test suite it's
3338
preferable to let a server farm do it in parallel, and then have the results
3439
in a nice test dashboard (which can tell you which failures are new, and
3540
presents results in a git log view, avoiding the need for most bisecting).
3641

37-
That exists [2], and community members may request an account. If you work for
42+
That exists [2]_, and community members may request an account. If you work for
3843
a big tech company, you'll need to help out with server costs to get access -
3944
but the CI is not restricted to running bcachefs tests: it runs any ktest test
4045
(which generally makes it easy to wrap other tests that can run in qemu).
4146

42-
Other things to think about:
43-
============================
47+
Other things to think about
48+
---------------------------
4449

4550
- How will we debug this code? Is there sufficient introspection to diagnose
4651
when something starts acting wonky on a user machine?
@@ -79,20 +84,22 @@ Other things to think about:
7984
tested? (Automated tests exists but aren't in the CI, due to the hassle of
8085
disk image management; coordinate to have them run.)
8186

82-
Mailing list, IRC:
83-
==================
87+
Mailing list, IRC
88+
-----------------
8489

85-
Patches should hit the list [3], but much discussion and code review happens on
86-
IRC as well [4]; many people appreciate the more conversational approach and
87-
quicker feedback.
90+
Patches should hit the list [3]_, but much discussion and code review happens
91+
on IRC as well [4]_; many people appreciate the more conversational approach
92+
and quicker feedback.
8893

8994
Additionally, we have a lively user community doing excellent QA work, which
9095
exists primarily on IRC. Please make use of that resource; user feedback is
9196
important for any nontrivial feature, and documenting it in commit messages
9297
would be a good idea.
9398

94-
[0]: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
95-
[1]: https://evilpiepirate.org/git/ktest.git/
96-
[2]: https://evilpiepirate.org/~testdashboard/ci/
97-
[3]: linux-bcachefs@vger.kernel.org
98-
[4]: irc.oftc.net#bcache, #bcachefs-dev
99+
.. rubric:: References
100+
101+
.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
102+
.. [1] https://evilpiepirate.org/git/ktest.git/
103+
.. [2] https://evilpiepirate.org/~testdashboard/ci/
104+
.. [3] linux-bcachefs@vger.kernel.org
105+
.. [4] irc.oftc.net#bcache, #bcachefs-dev
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
Casefolding
4+
===========
5+
6+
bcachefs has support for case-insensitive file and directory
7+
lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
8+
casefolding attributes.
9+
10+
The main usecase for casefolding is compatibility with software written
11+
against other filesystems that rely on casefolded lookups
12+
(eg. NTFS and Wine/Proton).
13+
Taking advantage of file-system level casefolding can lead to great
14+
loading time gains in many applications and games.
15+
16+
Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
17+
Once a directory has been flagged for casefolding, a feature bit
18+
is enabled on the superblock which marks the filesystem as using
19+
casefolding.
20+
When the feature bit for casefolding is enabled, it is no longer possible
21+
to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
22+
23+
On the lookup/query side: casefolding is implemented by allocating a new
24+
string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
25+
casefold the query string.
26+
27+
On the dirent side: casefolding is implemented by ensuring the `bkey`'s
28+
hash is made from the casefolded string and storing the cached casefolded
29+
name with the regular name in the dirent.
30+
31+
The structure looks like this:
32+
33+
* Regular: [dirent data][regular name][nul][nul]...
34+
* Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
35+
36+
(Do note, the number of NULs here is merely for illustration; their count can
37+
vary per-key, and they may not even be present if the key is aligned to
38+
`sizeof(u64)`.)
39+
40+
This is efficient as it means that for all file lookups that require casefolding,
41+
it has identical performance to a regular lookup:
42+
a hash comparison and a `memcmp` of the name.
43+
44+
Rationale
45+
---------
46+
47+
Several designs were considered for this system:
48+
One was to introduce a dirent_v2, however that would be painful especially as
49+
the hash system only has support for a single key type. This would also need
50+
`BCH_NAME_MAX` to change between versions, and a new feature bit.
51+
52+
Another option was to store without the two lengths, and just take the length of
53+
the regular name and casefolded name contiguously / 2 as the length. This would
54+
assume that the regular length == casefolded length, but that could potentially
55+
not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
56+
the lowercase unicode glyph.
57+
It would be possible to disregard the casefold cache for those cases, but it was
58+
decided to simply encode the two string lengths in the key to avoid random
59+
performance issues if this edgecase was ever hit.
60+
61+
The option settled on was to use a free-bit in d_type to mark a dirent as having
62+
a casefold cache, and then treat the first 4 bytes the name block as lengths.
63+
You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
64+
65+
The feature bit was used to allow casefolding support to be enabled for the majority
66+
of users, but some allow users who have no need for the feature to still use bcachefs as
67+
`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
68+
which may be decider between using bcachefs for eg. embedded platforms.
69+
70+
Other filesystems like ext4 and f2fs have a super-block level option for casefolding
71+
encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
72+
any encodings than a single UTF-8 version. When future encodings are desirable,
73+
they will be added trivially using the opts mechanism.
74+
75+
dentry/dcache considerations
76+
----------------------------
77+
78+
Currently, in casefolded directories, bcachefs (like other filesystems) will not cache
79+
negative dentry's.
80+
81+
This is because currently doing so presents a problem in the following scenario:
82+
83+
- Lookup file "blAH" in a casefolded directory
84+
- Creation of file "BLAH" in a casefolded directory
85+
- Lookup file "blAH" in a casefolded directory
86+
87+
This would fail if negative dentry's were cached.
88+
89+
This is slightly suboptimal, but could be fixed in future with some vfs work.
90+

Documentation/filesystems/bcachefs/index.rst

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,28 @@
44
bcachefs Documentation
55
======================
66

7+
Subsystem-specific development process notes
8+
--------------------------------------------
9+
10+
Development notes specific to bcachefs. These are intended to supplement
11+
:doc:`general kernel development handbook </process/index>`.
12+
713
.. toctree::
8-
:maxdepth: 2
14+
:maxdepth: 1
915
:numbered:
1016

1117
CodingStyle
1218
SubmittingPatches
19+
20+
Filesystem implementation
21+
-------------------------
22+
23+
Documentation for filesystem features and their implementation details.
24+
At this moment, only a few of these are described here.
25+
26+
.. toctree::
27+
:maxdepth: 1
28+
:numbered:
29+
30+
casefolding
1331
errorcodes

fs/bcachefs/Kconfig

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ config BCACHEFS_FS
1616
select ZSTD_COMPRESS
1717
select ZSTD_DECOMPRESS
1818
select CRYPTO
19-
select CRYPTO_SHA256
19+
select CRYPTO_LIB_SHA256
2020
select CRYPTO_CHACHA20
2121
select CRYPTO_POLY1305
2222
select KEYS

fs/bcachefs/Makefile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,6 @@ bcachefs-y := \
4141
extent_update.o \
4242
eytzinger.o \
4343
fs.o \
44-
fs-common.o \
4544
fs-ioctl.o \
4645
fs-io.o \
4746
fs-io-buffered.o \
@@ -64,9 +63,11 @@ bcachefs-y := \
6463
migrate.o \
6564
move.o \
6665
movinggc.o \
66+
namei.o \
6767
nocow_locking.o \
6868
opts.o \
6969
printbuf.o \
70+
progress.o \
7071
quota.o \
7172
rebalance.o \
7273
rcu_pending.o \

0 commit comments

Comments
 (0)