You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently received a report of poor performance doing sequential
buffered reads of a file with compressed extents. With bs=128k, a naive
sequential dd ran as fast on a compressed file as on an uncompressed
(1.2GB/s on my reproducing system) while with bs<32k, this performance
tanked down to ~300MB/s.
i.e.,
slow:
dd if=some-compressed-file of=/dev/null bs=4k count=X
vs
fast:
dd if=some-compressed-file of=/dev/null bs=128k count=Y
The cause of this slowness is overhead to do with looking up extent_maps
to enable readahead pre-caching on compressed extents
(add_ra_bio_pages()), as well as some overhead in the generic VFS
readahead code we hit more in the slow case. Notably, the main
difference between the two read sizes is that in the large sized request
case, we call btrfs_readahead() relatively rarely while in the smaller
request we call it for every compressed extent. So the fast case stays
in the btrfs readahead loop:
while ((folio = readahead_folio(rac)) != NULL)
btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
where the slower one breaks out of that loop every time. This results in
calling add_ra_bio_pages a lot, doing lots of extent_map lookups,
extent_map locking, etc.
This happens because although add_ra_bio_pages() does add the
appropriate un-compressed file pages to the cache, it does not
communicate back to the ractl in any way. To solve this, we should be
using readahead_expand() to signal to readahead to expand the readahead
window.
This change passes the readahead_control into the btrfs_bio_ctrl and in
the case of compressed reads sets the expansion to the size of the
extent_map we already looked up anyway. It skips the subpage case as
that one already doesn't do add_ra_bio_pages().
With this change, whether we use bs=4k or bs=128k, btrfs expands the
readahead window up to the largest compressed extent we have seen so far
(in the trivial example: 128k) and the call stacks of the two modes look
identical. Notably, we barely call add_ra_bio_pages at all. And the
performance becomes identical as well. So this change certainly "fixes"
this performance problem.
Of course, it does seem to beg a few questions:
1. Will this waste too much page cache with a too large ra window?
2. Will this somehow cause bugs prevented by the more thoughtful
checking in add_ra_bio_pages?
3. Should we delete add_ra_bio_pages?
My stabs at some answers:
1. Hard to say. See attempts at generic performance testing below. Is
there a "readahead_shrink" we should be using? Should we expand more
slowly, by half the remaining em size each time?
2. I don't think so. Since the new behavior is indistinguishable from
reading the file with a larger read size passed in, I don't see why
one would be safe but not the other.
3. Probably! I tested that and it was fine in fstests, and it seems like
the pages would get re-used just as well in the readahead case.
However, it is possible some reads that use page cache but not
btrfs_readahead() could suffer. I will investigate this further as a
follow up.
I tested the performance implications of this change in 3 ways (using
compress-force=zstd:3 for compression):
Directly test the affected workload of small sequential reads on a
compressed file (improved from ~250MB/s to ~1.2GB/s)
==========for-next==========
dd /mnt/lol/non-cmpr 4k
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.02983 s, 712 MB/s
dd /mnt/lol/non-cmpr 128k
32768+0 records in
32768+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.92403 s, 725 MB/s
dd /mnt/lol/cmpr 4k
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.8832 s, 240 MB/s
dd /mnt/lol/cmpr 128k
32768+0 records in
32768+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.71001 s, 1.2 GB/s
==========ra-expand==========
dd /mnt/lol/non-cmpr 4k
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.09001 s, 705 MB/s
dd /mnt/lol/non-cmpr 128k
32768+0 records in
32768+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.07664 s, 707 MB/s
dd /mnt/lol/cmpr 4k
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.79531 s, 1.1 GB/s
dd /mnt/lol/cmpr 128k
32768+0 records in
32768+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.69533 s, 1.2 GB/s
Built the linux kernel from clean (no change)
Ran fsperf. Mostly neutral results with some improvements and
regressions here and there.
Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/34601559-6c16-6ccc-1793-20a97ca0dbba@gmx.net/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
0 commit comments