Skip to content

Commit 89cfa89

Browse files
Dave Chinnerdchinner
authored andcommitted
xfs: reduce AGF hold times during fstrim operations
fstrim will hold the AGF lock for as long as it takes to walk and discard all the free space in the AG that meets the userspace trim criteria. For AGs with lots of free space extents (e.g. millions) or the underlying device is really slow at processing discard requests (e.g. Ceph RBD), this means the AGF hold time is often measured in minutes to hours, not a few milliseconds as we normal see with non-discard based operations. This can result in the entire filesystem hanging whilst the long-running fstrim is in progress. We can have transactions get stuck waiting for the AGF lock (data or metadata extent allocation and freeing), and then more transactions get stuck waiting on the locks those transactions hold. We can get to the point where fstrim blocks an extent allocation or free operation long enough that it ends up pinning the tail of the log and the log then runs out of space. At this point, every modification in the filesystem gets blocked. This includes read operations, if atime updates need to be made. To fix this problem, we need to be able to discard free space extents safely without holding the AGF lock. Fortunately, we already do this with online discard via busy extents. We can mark free space extents as "busy being discarded" under the AGF lock and then unlock the AGF, knowing that nobody will be able to allocate that free space extent until we remove it from the busy tree. Modify xfs_trim_extents to use the same asynchronous discard mechanism backed by busy extents as is used with online discard. This results in the AGF only needing to be held for short periods of time and it is never held while we issue discards. Hence if discard submission gets throttled because it is slow and/or there are lots of them, we aren't preventing other operations from being performed on AGF while we wait for discards to complete... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
1 parent 428c443 commit 89cfa89

File tree

3 files changed

+188
-24
lines changed

3 files changed

+188
-24
lines changed

fs/xfs/xfs_discard.c

Lines changed: 156 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,56 @@
1919
#include "xfs_log.h"
2020
#include "xfs_ag.h"
2121

22+
/*
23+
* Notes on an efficient, low latency fstrim algorithm
24+
*
25+
* We need to walk the filesystem free space and issue discards on the free
26+
* space that meet the search criteria (size and location). We cannot issue
27+
* discards on extents that might be in use, or are so recently in use they are
28+
* still marked as busy. To serialise against extent state changes whilst we are
29+
* gathering extents to trim, we must hold the AGF lock to lock out other
30+
* allocations and extent free operations that might change extent state.
31+
*
32+
* However, we cannot just hold the AGF for the entire AG free space walk whilst
33+
* we issue discards on each free space that is found. Storage devices can have
34+
* extremely slow discard implementations (e.g. ceph RBD) and so walking a
35+
* couple of million free extents and issuing synchronous discards on each
36+
* extent can take a *long* time. Whilst we are doing this walk, nothing else
37+
* can access the AGF, and we can stall transactions and hence the log whilst
38+
* modifications wait for the AGF lock to be released. This can lead hung tasks
39+
* kicking the hung task timer and rebooting the system. This is bad.
40+
*
41+
* Hence we need to take a leaf from the bulkstat playbook. It takes the AGI
42+
* lock, gathers a range of inode cluster buffers that are allocated, drops the
43+
* AGI lock and then reads all the inode cluster buffers and processes them. It
44+
* loops doing this, using a cursor to keep track of where it is up to in the AG
45+
* for each iteration to restart the INOBT lookup from.
46+
*
47+
* We can't do this exactly with free space - once we drop the AGF lock, the
48+
* state of the free extent is out of our control and we cannot run a discard
49+
* safely on it in this situation. Unless, of course, we've marked the free
50+
* extent as busy and undergoing a discard operation whilst we held the AGF
51+
* locked.
52+
*
53+
* This is exactly how online discard works - free extents are marked busy when
54+
* they are freed, and once the extent free has been committed to the journal,
55+
* the busy extent record is marked as "undergoing discard" and the discard is
56+
* then issued on the free extent. Once the discard completes, the busy extent
57+
* record is removed and the extent is able to be allocated again.
58+
*
59+
* In the context of fstrim, if we find a free extent we need to discard, we
60+
* don't have to discard it immediately. All we need to do it record that free
61+
* extent as being busy and under discard, and all the allocation routines will
62+
* now avoid trying to allocate it. Hence if we mark the extent as busy under
63+
* the AGF lock, we can safely discard it without holding the AGF lock because
64+
* nothing will attempt to allocate that free space until the discard completes.
65+
*
66+
* This also allows us to issue discards asynchronously like we do with online
67+
* discard, and so for fast devices fstrim will run much faster as we can have
68+
* multiple discard operations in flight at once, as well as pipeline the free
69+
* extent search so that it overlaps in flight discard IO.
70+
*/
71+
2272
struct workqueue_struct *xfs_discard_wq;
2373

2474
static void
@@ -94,21 +144,22 @@ xfs_discard_extents(
94144
}
95145

96146

97-
STATIC int
98-
xfs_trim_extents(
147+
static int
148+
xfs_trim_gather_extents(
99149
struct xfs_perag *pag,
100150
xfs_daddr_t start,
101151
xfs_daddr_t end,
102152
xfs_daddr_t minlen,
153+
struct xfs_alloc_rec_incore *tcur,
154+
struct xfs_busy_extents *extents,
103155
uint64_t *blocks_trimmed)
104156
{
105157
struct xfs_mount *mp = pag->pag_mount;
106-
struct block_device *bdev = mp->m_ddev_targp->bt_bdev;
107158
struct xfs_btree_cur *cur;
108159
struct xfs_buf *agbp;
109-
struct xfs_agf *agf;
110160
int error;
111161
int i;
162+
int batch = 100;
112163

113164
/*
114165
* Force out the log. This means any transactions that might have freed
@@ -120,20 +171,28 @@ xfs_trim_extents(
120171
error = xfs_alloc_read_agf(pag, NULL, 0, &agbp);
121172
if (error)
122173
return error;
123-
agf = agbp->b_addr;
124174

125175
cur = xfs_allocbt_init_cursor(mp, NULL, agbp, pag, XFS_BTNUM_CNT);
126176

127177
/*
128-
* Look up the longest btree in the AGF and start with it.
178+
* Look up the extent length requested in the AGF and start with it.
129179
*/
130-
error = xfs_alloc_lookup_ge(cur, 0, be32_to_cpu(agf->agf_longest), &i);
180+
if (tcur->ar_startblock == NULLAGBLOCK)
181+
error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i);
182+
else
183+
error = xfs_alloc_lookup_le(cur, tcur->ar_startblock,
184+
tcur->ar_blockcount, &i);
131185
if (error)
132186
goto out_del_cursor;
187+
if (i == 0) {
188+
/* nothing of that length left in the AG, we are done */
189+
tcur->ar_blockcount = 0;
190+
goto out_del_cursor;
191+
}
133192

134193
/*
135194
* Loop until we are done with all extents that are large
136-
* enough to be worth discarding.
195+
* enough to be worth discarding or we hit batch limits.
137196
*/
138197
while (i) {
139198
xfs_agblock_t fbno;
@@ -148,7 +207,16 @@ xfs_trim_extents(
148207
error = -EFSCORRUPTED;
149208
break;
150209
}
151-
ASSERT(flen <= be32_to_cpu(agf->agf_longest));
210+
211+
if (--batch <= 0) {
212+
/*
213+
* Update the cursor to point at this extent so we
214+
* restart the next batch from this extent.
215+
*/
216+
tcur->ar_startblock = fbno;
217+
tcur->ar_blockcount = flen;
218+
break;
219+
}
152220

153221
/*
154222
* use daddr format for all range/len calculations as that is
@@ -163,6 +231,7 @@ xfs_trim_extents(
163231
*/
164232
if (dlen < minlen) {
165233
trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen);
234+
tcur->ar_blockcount = 0;
166235
break;
167236
}
168237

@@ -185,29 +254,98 @@ xfs_trim_extents(
185254
goto next_extent;
186255
}
187256

188-
trace_xfs_discard_extent(mp, pag->pag_agno, fbno, flen);
189-
error = blkdev_issue_discard(bdev, dbno, dlen, GFP_NOFS);
190-
if (error)
191-
break;
257+
xfs_extent_busy_insert_discard(pag, fbno, flen,
258+
&extents->extent_list);
192259
*blocks_trimmed += flen;
193-
194260
next_extent:
195261
error = xfs_btree_decrement(cur, 0, &i);
196262
if (error)
197263
break;
198264

199-
if (fatal_signal_pending(current)) {
200-
error = -ERESTARTSYS;
201-
break;
202-
}
265+
/*
266+
* If there's no more records in the tree, we are done. Set the
267+
* cursor block count to 0 to indicate to the caller that there
268+
* is no more extents to search.
269+
*/
270+
if (i == 0)
271+
tcur->ar_blockcount = 0;
203272
}
204273

274+
/*
275+
* If there was an error, release all the gathered busy extents because
276+
* we aren't going to issue a discard on them any more.
277+
*/
278+
if (error)
279+
xfs_extent_busy_clear(mp, &extents->extent_list, false);
205280
out_del_cursor:
206281
xfs_btree_del_cursor(cur, error);
207282
xfs_buf_relse(agbp);
208283
return error;
209284
}
210285

286+
/*
287+
* Iterate the free list gathering extents and discarding them. We need a cursor
288+
* for the repeated iteration of gather/discard loop, so use the longest extent
289+
* we found in the last batch as the key to start the next.
290+
*/
291+
static int
292+
xfs_trim_extents(
293+
struct xfs_perag *pag,
294+
xfs_daddr_t start,
295+
xfs_daddr_t end,
296+
xfs_daddr_t minlen,
297+
uint64_t *blocks_trimmed)
298+
{
299+
struct xfs_alloc_rec_incore tcur = {
300+
.ar_blockcount = pag->pagf_longest,
301+
.ar_startblock = NULLAGBLOCK,
302+
};
303+
int error = 0;
304+
305+
do {
306+
struct xfs_busy_extents *extents;
307+
308+
extents = kzalloc(sizeof(*extents), GFP_KERNEL);
309+
if (!extents) {
310+
error = -ENOMEM;
311+
break;
312+
}
313+
314+
extents->mount = pag->pag_mount;
315+
extents->owner = extents;
316+
INIT_LIST_HEAD(&extents->extent_list);
317+
318+
error = xfs_trim_gather_extents(pag, start, end, minlen,
319+
&tcur, extents, blocks_trimmed);
320+
if (error) {
321+
kfree(extents);
322+
break;
323+
}
324+
325+
/*
326+
* We hand the extent list to the discard function here so the
327+
* discarded extents can be removed from the busy extent list.
328+
* This allows the discards to run asynchronously with gathering
329+
* the next round of extents to discard.
330+
*
331+
* However, we must ensure that we do not reference the extent
332+
* list after this function call, as it may have been freed by
333+
* the time control returns to us.
334+
*/
335+
error = xfs_discard_extents(pag->pag_mount, extents);
336+
if (error)
337+
break;
338+
339+
if (fatal_signal_pending(current)) {
340+
error = -ERESTARTSYS;
341+
break;
342+
}
343+
} while (tcur.ar_blockcount != 0);
344+
345+
return error;
346+
347+
}
348+
211349
/*
212350
* trim a range of the filesystem.
213351
*

fs/xfs/xfs_extent_busy.c

Lines changed: 28 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,13 @@
1919
#include "xfs_log.h"
2020
#include "xfs_ag.h"
2121

22-
void
23-
xfs_extent_busy_insert(
24-
struct xfs_trans *tp,
22+
static void
23+
xfs_extent_busy_insert_list(
2524
struct xfs_perag *pag,
2625
xfs_agblock_t bno,
2726
xfs_extlen_t len,
28-
unsigned int flags)
27+
unsigned int flags,
28+
struct list_head *busy_list)
2929
{
3030
struct xfs_extent_busy *new;
3131
struct xfs_extent_busy *busyp;
@@ -40,7 +40,7 @@ xfs_extent_busy_insert(
4040
new->flags = flags;
4141

4242
/* trace before insert to be able to see failed inserts */
43-
trace_xfs_extent_busy(tp->t_mountp, pag->pag_agno, bno, len);
43+
trace_xfs_extent_busy(pag->pag_mount, pag->pag_agno, bno, len);
4444

4545
spin_lock(&pag->pagb_lock);
4646
rbp = &pag->pagb_tree.rb_node;
@@ -62,10 +62,32 @@ xfs_extent_busy_insert(
6262
rb_link_node(&new->rb_node, parent, rbp);
6363
rb_insert_color(&new->rb_node, &pag->pagb_tree);
6464

65-
list_add(&new->list, &tp->t_busy);
65+
list_add(&new->list, busy_list);
6666
spin_unlock(&pag->pagb_lock);
6767
}
6868

69+
void
70+
xfs_extent_busy_insert(
71+
struct xfs_trans *tp,
72+
struct xfs_perag *pag,
73+
xfs_agblock_t bno,
74+
xfs_extlen_t len,
75+
unsigned int flags)
76+
{
77+
xfs_extent_busy_insert_list(pag, bno, len, flags, &tp->t_busy);
78+
}
79+
80+
void
81+
xfs_extent_busy_insert_discard(
82+
struct xfs_perag *pag,
83+
xfs_agblock_t bno,
84+
xfs_extlen_t len,
85+
struct list_head *busy_list)
86+
{
87+
xfs_extent_busy_insert_list(pag, bno, len, XFS_EXTENT_BUSY_DISCARDED,
88+
busy_list);
89+
}
90+
6991
/*
7092
* Search for a busy extent within the range of the extent we are about to
7193
* allocate. You need to be holding the busy extent tree lock when calling

fs/xfs/xfs_extent_busy.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,10 @@ void
4949
xfs_extent_busy_insert(struct xfs_trans *tp, struct xfs_perag *pag,
5050
xfs_agblock_t bno, xfs_extlen_t len, unsigned int flags);
5151

52+
void
53+
xfs_extent_busy_insert_discard(struct xfs_perag *pag, xfs_agblock_t bno,
54+
xfs_extlen_t len, struct list_head *busy_list);
55+
5256
void
5357
xfs_extent_busy_clear(struct xfs_mount *mp, struct list_head *list,
5458
bool do_discard);

0 commit comments

Comments
 (0)