Skip to content

Commit 3c919b0

Browse files
author
Darrick J. Wong
committed
xfs: reserve less log space when recovering log intent items
Wengang Wang reports that a customer's system was running a number of truncate operations on a filesystem with a very small log. Contention on the reserve heads lead to other threads stalling on smaller updates (e.g. mtime updates) long enough to result in the node being rebooted on account of the lack of responsivenes. The node failed to recover because log recovery of an EFI became stuck waiting for a grant of reserve space. From Wengang's report: "For the file deletion, log bytes are reserved basing on xfs_mount->tr_itruncate which is: tr_logres = 175488, tr_logcount = 2, tr_logflags = XFS_TRANS_PERM_LOG_RES, "You see it's a permanent log reservation with two log operations (two transactions in rolling mode). After calculation (xlog_calc_unit_res() adds space for various log headers), the final log space needed per transaction changes from 175488 to 180208 bytes. So the total log space needed is 360416 bytes (180208 * 2). [That quantity] of log space (360416 bytes) needs to be reserved for both run time inode removing (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover())." In other words, runtime pre-reserves 360K of space in anticipation of running a chain of two transactions in which each transaction gets a 180K reservation. Now that we've allocated the transaction, we delete the bmap mapping, log an EFI to free the space, and roll the transaction as part of finishing the deferops chain. Rolling creates a new xfs_trans which shares its ticket with the old transaction. Next, xfs_trans_roll calls __xfs_trans_commit with regrant == true, which calls xlog_cil_commit with the same regrant parameter. xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and subtracts t_curr_res from the reservation and write heads. If the filesystem is fresh and the first transaction only used (say) 20K, then t_curr_res will be 160K, and we give that much reservation back to the reservation head. Or if the file is really fragmented and the first transaction actually uses 170K, then t_curr_res will be 10K, and that's what we give back to the reservation. Having done that, we're now headed into the second transaction with an EFI and 180K of reservation. Other threads apparently consumed all the reservation for smaller transactions, such as timestamp updates. Now let's say the first transaction gets written to disk and we crash without ever completing the second transaction. Now we remount the fs, log recovery finds the unfinished EFI, and calls xfs_efi_recover to finish the EFI. However, xfs_efi_recover starts a new tr_itruncate tranasction, which asks for 360K log reservation. This is a lot more than the 180K that we had reserved at the time of the crash. If the first EFI to be recovered is also pinning the tail of the log, we will be unable to free any space in the log, and recovery livelocks. Wengang confirmed this: "Now we have the second transaction which has 180208 log bytes reserved too. The second transaction is supposed to process intents including extent freeing. With my hacking patch, I blocked the extent freeing 5 hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved. "With my test case, other transactions (update timestamps) then happen. As my hacking patch pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available log space (in memory in on disk). And finally the on disk (and in memory) available log space goes down near to 180208 bytes. Those 180208 bytes are reserved by [the] second (extent-free) transaction [in the chain]." Wengang and I noticed that EFI recovery starts a transaction, completes one step of the chain, and commits the transaction without completing any other steps of the chain. Those subsequent steps are completed by xlog_finish_defer_ops, which allocates yet another transaction to finish the rest of the chain. That transaction gets the same tr_logres as the head transaction, but with tr_logcount = 1 to force regranting with every roll to avoid livelocks. In other words, we already figured this out in commit 929b92f ("xfs: xfs_defer_capture should absorb remaining transaction reservation"), but should have applied that logic to each intent item's recovery function. For Wengang's case, the xfs_trans_alloc call in the EFI recovery function should only be asking for a single transaction's worth of log reservation -- 180K, not 360K. Quoting Wengang again: "With log recovery, during EFI recovery, we use tr_itruncate again to reserve two transactions that needs 360416 log bytes. Reserving 360416 bytes fails [stalls] because we now only have about 180208 available. "Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes. We have (a bit) more than 180208 available log bytes on disk, so [if we decrease the reservation to 180K] the reservation goes and the recovery [finishes]. That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res xfs_mount->tr_ext_free { tr_logres = 175488, tr_logcount = 0, tr_logflags = 0, } "and use tr_ext_free instead of tr_itruncate in EFI recover." However, I don't think it quite makes sense to create an entirely new transaction reservation type to handle single-stepping during log recovery. Instead, we should copy the transaction reservation information in the xfs_mount, change tr_logcount to 1, and pass that into xfs_trans_alloc. We know this won't risk changing the min log size computation since we always ask for a fraction of the reservation for all known transaction types. This looks like it's been lurking in the codebase since commit 3d3c8b5, which changed the xfs_trans_reserve call in xlog_recover_process_efi to use the tr_logcount in tr_itruncate. That changed the EFI recovery transaction from making a non-XFS_TRANS_PERM_LOG_RES request for one transaction's worth of log space to a XFS_TRANS_PERM_LOG_RES request for two transactions worth. Fixes: 3d3c8b5 ("xfs: refactor xfs_trans_reserve() interface") Complements: 929b92f ("xfs: xfs_defer_capture should absorb remaining transaction reservation") Suggested-by: Wengang Wang <wen.gang.wang@oracle.com> Cc: Srikanth C S <srikanth.c.s@oracle.com> [djwong: apply the same transformation to all log intent recovery] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
1 parent 74ad469 commit 3c919b0

File tree

6 files changed

+40
-9
lines changed

6 files changed

+40
-9
lines changed

fs/xfs/libxfs/xfs_log_recover.h

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,4 +131,26 @@ void xlog_check_buf_cancel_table(struct xlog *log);
131131
#define xlog_check_buf_cancel_table(log) do { } while (0)
132132
#endif
133133

134+
/*
135+
* Transform a regular reservation into one suitable for recovery of a log
136+
* intent item.
137+
*
138+
* Intent recovery only runs a single step of the transaction chain and defers
139+
* the rest to a separate transaction. Therefore, we reduce logcount to 1 here
140+
* to avoid livelocks if the log grant space is nearly exhausted due to the
141+
* recovered intent pinning the tail. Keep the same logflags to avoid tripping
142+
* asserts elsewhere. Struct copies abound below.
143+
*/
144+
static inline struct xfs_trans_res
145+
xlog_recover_resv(const struct xfs_trans_res *r)
146+
{
147+
struct xfs_trans_res ret = {
148+
.tr_logres = r->tr_logres,
149+
.tr_logcount = 1,
150+
.tr_logflags = r->tr_logflags,
151+
};
152+
153+
return ret;
154+
}
155+
134156
#endif /* __XFS_LOG_RECOVER_H__ */

fs/xfs/xfs_attr_item.c

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -547,7 +547,7 @@ xfs_attri_item_recover(
547547
struct xfs_inode *ip;
548548
struct xfs_da_args *args;
549549
struct xfs_trans *tp;
550-
struct xfs_trans_res tres;
550+
struct xfs_trans_res resv;
551551
struct xfs_attri_log_format *attrp;
552552
struct xfs_attri_log_nameval *nv = attrip->attri_nameval;
553553
int error;
@@ -618,8 +618,9 @@ xfs_attri_item_recover(
618618
goto out;
619619
}
620620

621-
xfs_init_attr_trans(args, &tres, &total);
622-
error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp);
621+
xfs_init_attr_trans(args, &resv, &total);
622+
resv = xlog_recover_resv(&resv);
623+
error = xfs_trans_alloc(mp, &resv, total, 0, XFS_TRANS_RESERVE, &tp);
623624
if (error)
624625
goto out;
625626

fs/xfs/xfs_bmap_item.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -490,6 +490,7 @@ xfs_bui_item_recover(
490490
struct list_head *capture_list)
491491
{
492492
struct xfs_bmap_intent fake = { };
493+
struct xfs_trans_res resv;
493494
struct xfs_bui_log_item *buip = BUI_ITEM(lip);
494495
struct xfs_trans *tp;
495496
struct xfs_inode *ip = NULL;
@@ -515,7 +516,8 @@ xfs_bui_item_recover(
515516
return error;
516517

517518
/* Allocate transaction and do the work. */
518-
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
519+
resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
520+
error = xfs_trans_alloc(mp, &resv,
519521
XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
520522
if (error)
521523
goto err_rele;

fs/xfs/xfs_extfree_item.c

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -660,6 +660,7 @@ xfs_efi_item_recover(
660660
struct xfs_log_item *lip,
661661
struct list_head *capture_list)
662662
{
663+
struct xfs_trans_res resv;
663664
struct xfs_efi_log_item *efip = EFI_ITEM(lip);
664665
struct xfs_mount *mp = lip->li_log->l_mp;
665666
struct xfs_efd_log_item *efdp;
@@ -683,7 +684,8 @@ xfs_efi_item_recover(
683684
}
684685
}
685686

686-
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
687+
resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
688+
error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp);
687689
if (error)
688690
return error;
689691
efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);

fs/xfs/xfs_refcount_item.c

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -477,6 +477,7 @@ xfs_cui_item_recover(
477477
struct xfs_log_item *lip,
478478
struct list_head *capture_list)
479479
{
480+
struct xfs_trans_res resv;
480481
struct xfs_cui_log_item *cuip = CUI_ITEM(lip);
481482
struct xfs_cud_log_item *cudp;
482483
struct xfs_trans *tp;
@@ -514,8 +515,9 @@ xfs_cui_item_recover(
514515
* doesn't fit. We need to reserve enough blocks to handle a
515516
* full btree split on either end of the refcount range.
516517
*/
517-
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
518-
mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
518+
resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
519+
error = xfs_trans_alloc(mp, &resv, mp->m_refc_maxlevels * 2, 0,
520+
XFS_TRANS_RESERVE, &tp);
519521
if (error)
520522
return error;
521523

fs/xfs/xfs_rmap_item.c

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -507,6 +507,7 @@ xfs_rui_item_recover(
507507
struct xfs_log_item *lip,
508508
struct list_head *capture_list)
509509
{
510+
struct xfs_trans_res resv;
510511
struct xfs_rui_log_item *ruip = RUI_ITEM(lip);
511512
struct xfs_rud_log_item *rudp;
512513
struct xfs_trans *tp;
@@ -530,8 +531,9 @@ xfs_rui_item_recover(
530531
}
531532
}
532533

533-
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
534-
mp->m_rmap_maxlevels, 0, XFS_TRANS_RESERVE, &tp);
534+
resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
535+
error = xfs_trans_alloc(mp, &resv, mp->m_rmap_maxlevels, 0,
536+
XFS_TRANS_RESERVE, &tp);
535537
if (error)
536538
return error;
537539
rudp = xfs_trans_get_rud(tp, ruip);

0 commit comments

Comments
 (0)