Skip to content

Commit 369c001

Browse files
author
Darrick J. Wong
committed
xfs: rewrite xchk_inode_is_allocated to work properly
Back in the mists of time[1], I proposed this function to assist the inode btree scrubbers in checking the inode btree contents against the allocation state of the inode records. The original version performed a direct lookup in the inode cache and returned the allocation status if the cached inode hadn't been reused and wasn't in an intermediate state. Brian thought it would be better to use the usual iget/irele mechanisms, so that was changed for the final version. Unfortunately, this hasn't aged well -- the IGET_INCORE flag only has one user and clutters up the regular iget path, which makes it hard to reason about how it actually works. Worse yet, the inode inactivation series silently broke it because iget won't return inodes that are anywhere in the inactivation machinery, even though the caller is already required to prevent inode allocation and freeing. Inodes in the inactivation machinery are still allocated, but the current code's interactions with the iget code prevent us from being able to say that. Now that I understand the inode lifecycle better than I did in early 2017, I now realize that as long as the cached inode hasn't been reused and isn't actively being reclaimed, it's safe to access the i_mode field (with the AGI, rcu, and i_flags locks held), and we don't need to worry about the inode being freed out from under us. Therefore, port the original version to modern code structure, which fixes the brokennes w.r.t. inactivation. In the next patch we'll remove IGET_INCORE since it's no longer necessary. [1] https://lore.kernel.org/linux-xfs/149643868294.23065.8094890990886436794.stgit@birch.djwong.org/ Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
1 parent 0d29663 commit 369c001

File tree

4 files changed

+162
-24
lines changed

4 files changed

+162
-24
lines changed

fs/xfs/scrub/common.c

Lines changed: 137 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1232,38 +1232,153 @@ xchk_fsgates_enable(
12321232
}
12331233

12341234
/*
1235-
* Decide if this is this a cached inode that's also allocated.
1235+
* Decide if this is this a cached inode that's also allocated. The caller
1236+
* must hold a reference to an AG and the AGI buffer lock to prevent inodes
1237+
* from being allocated or freed.
12361238
*
1237-
* Look up an inode by number in the given file system. If the inode is
1238-
* in cache and isn't in purgatory, return 1 if the inode is allocated
1239-
* and 0 if it is not. For all other cases (not in cache, being torn
1240-
* down, etc.), return a negative error code.
1239+
* Look up an inode by number in the given file system. If the inode number
1240+
* is invalid, return -EINVAL. If the inode is not in cache, return -ENODATA.
1241+
* If the inode is being reclaimed, return -ENODATA because we know the inode
1242+
* cache cannot be updating the ondisk metadata.
12411243
*
1242-
* The caller has to prevent inode allocation and freeing activity,
1243-
* presumably by locking the AGI buffer. This is to ensure that an
1244-
* inode cannot transition from allocated to freed until the caller is
1245-
* ready to allow that. If the inode is in an intermediate state (new,
1246-
* reclaimable, or being reclaimed), -EAGAIN will be returned; if the
1247-
* inode is not in the cache, -ENOENT will be returned. The caller must
1248-
* deal with these scenarios appropriately.
1249-
*
1250-
* This is a specialized use case for the online scrubber; if you're
1251-
* reading this, you probably want xfs_iget.
1244+
* Otherwise, the incore inode is the one we want, and it is either live,
1245+
* somewhere in the inactivation machinery, or reclaimable. The inode is
1246+
* allocated if i_mode is nonzero. In all three cases, the cached inode will
1247+
* be more up to date than the ondisk inode buffer, so we must use the incore
1248+
* i_mode.
12521249
*/
12531250
int
12541251
xchk_inode_is_allocated(
12551252
struct xfs_scrub *sc,
1256-
xfs_ino_t ino,
1253+
xfs_agino_t agino,
12571254
bool *inuse)
12581255
{
1256+
struct xfs_mount *mp = sc->mp;
1257+
struct xfs_perag *pag = sc->sa.pag;
1258+
xfs_ino_t ino;
12591259
struct xfs_inode *ip;
12601260
int error;
12611261

1262-
error = xfs_iget(sc->mp, sc->tp, ino, XFS_IGET_INCORE, 0, &ip);
1263-
if (error)
1264-
return error;
1262+
/* caller must hold perag reference */
1263+
if (pag == NULL) {
1264+
ASSERT(pag != NULL);
1265+
return -EINVAL;
1266+
}
12651267

1266-
*inuse = !!(VFS_I(ip)->i_mode);
1267-
xfs_irele(ip);
1268-
return 0;
1268+
/* caller must have AGI buffer */
1269+
if (sc->sa.agi_bp == NULL) {
1270+
ASSERT(sc->sa.agi_bp != NULL);
1271+
return -EINVAL;
1272+
}
1273+
1274+
/* reject inode numbers outside existing AGs */
1275+
ino = XFS_AGINO_TO_INO(sc->mp, pag->pag_agno, agino);
1276+
if (!xfs_verify_ino(mp, ino))
1277+
return -EINVAL;
1278+
1279+
error = -ENODATA;
1280+
rcu_read_lock();
1281+
ip = radix_tree_lookup(&pag->pag_ici_root, agino);
1282+
if (!ip) {
1283+
/* cache miss */
1284+
goto out_rcu;
1285+
}
1286+
1287+
/*
1288+
* If the inode number doesn't match, the incore inode got reused
1289+
* during an RCU grace period and the radix tree hasn't been updated.
1290+
* This isn't the inode we want.
1291+
*/
1292+
spin_lock(&ip->i_flags_lock);
1293+
if (ip->i_ino != ino)
1294+
goto out_skip;
1295+
1296+
trace_xchk_inode_is_allocated(ip);
1297+
1298+
/*
1299+
* We have an incore inode that matches the inode we want, and the
1300+
* caller holds the perag structure and the AGI buffer. Let's check
1301+
* our assumptions below:
1302+
*/
1303+
1304+
#ifdef DEBUG
1305+
/*
1306+
* (1) If the incore inode is live (i.e. referenced from the dcache),
1307+
* it will not be INEW, nor will it be in the inactivation or reclaim
1308+
* machinery. The ondisk inode had better be allocated. This is the
1309+
* most trivial case.
1310+
*/
1311+
if (!(ip->i_flags & (XFS_NEED_INACTIVE | XFS_INEW | XFS_IRECLAIMABLE |
1312+
XFS_INACTIVATING))) {
1313+
/* live inode */
1314+
ASSERT(VFS_I(ip)->i_mode != 0);
1315+
}
1316+
1317+
/*
1318+
* If the incore inode is INEW, there are several possibilities:
1319+
*
1320+
* (2) For a file that is being created, note that we allocate the
1321+
* ondisk inode before allocating, initializing, and adding the incore
1322+
* inode to the radix tree.
1323+
*
1324+
* (3) If the incore inode is being recycled, the inode has to be
1325+
* allocated because we don't allow freed inodes to be recycled.
1326+
* Recycling doesn't touch i_mode.
1327+
*/
1328+
if (ip->i_flags & XFS_INEW) {
1329+
/* created on disk already or recycling */
1330+
ASSERT(VFS_I(ip)->i_mode != 0);
1331+
}
1332+
1333+
/*
1334+
* (4) If the inode is queued for inactivation (NEED_INACTIVE) but
1335+
* inactivation has not started (!INACTIVATING), it is still allocated.
1336+
*/
1337+
if ((ip->i_flags & XFS_NEED_INACTIVE) &&
1338+
!(ip->i_flags & XFS_INACTIVATING)) {
1339+
/* definitely before difree */
1340+
ASSERT(VFS_I(ip)->i_mode != 0);
1341+
}
1342+
#endif
1343+
1344+
/*
1345+
* If the incore inode is undergoing inactivation (INACTIVATING), there
1346+
* are two possibilities:
1347+
*
1348+
* (5) It is before the point where it would get freed ondisk, in which
1349+
* case i_mode is still nonzero.
1350+
*
1351+
* (6) It has already been freed, in which case i_mode is zero.
1352+
*
1353+
* We don't take the ILOCK here, but difree and dialloc update the AGI,
1354+
* and we've taken the AGI buffer lock, which prevents that from
1355+
* happening.
1356+
*/
1357+
1358+
/*
1359+
* (7) Inodes undergoing inactivation (INACTIVATING) or queued for
1360+
* reclaim (IRECLAIMABLE) could be allocated or free. i_mode still
1361+
* reflects the ondisk state.
1362+
*/
1363+
1364+
/*
1365+
* (8) If the inode is in IFLUSHING, it's safe to query i_mode because
1366+
* the flush code uses i_mode to format the ondisk inode.
1367+
*/
1368+
1369+
/*
1370+
* (9) If the inode is in IRECLAIM and was reachable via the radix
1371+
* tree, it still has the same i_mode as it did before it entered
1372+
* reclaim. The inode object is still alive because we hold the RCU
1373+
* read lock.
1374+
*/
1375+
1376+
*inuse = VFS_I(ip)->i_mode != 0;
1377+
error = 0;
1378+
1379+
out_skip:
1380+
spin_unlock(&ip->i_flags_lock);
1381+
out_rcu:
1382+
rcu_read_unlock();
1383+
return error;
12691384
}

fs/xfs/scrub/common.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,7 @@ static inline bool xchk_need_intent_drain(struct xfs_scrub *sc)
203203

204204
void xchk_fsgates_enable(struct xfs_scrub *sc, unsigned int scrub_fshooks);
205205

206-
int xchk_inode_is_allocated(struct xfs_scrub *sc, xfs_ino_t ino, bool *inuse);
206+
int xchk_inode_is_allocated(struct xfs_scrub *sc, xfs_agino_t agino,
207+
bool *inuse);
207208

208209
#endif /* __XFS_SCRUB_COMMON_H__ */

fs/xfs/scrub/ialloc.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -328,7 +328,7 @@ xchk_iallocbt_check_cluster_ifree(
328328
goto out;
329329
}
330330

331-
error = xchk_inode_is_allocated(bs->sc, fsino, &ino_inuse);
331+
error = xchk_inode_is_allocated(bs->sc, agino, &ino_inuse);
332332
if (error == -ENODATA) {
333333
/* Not cached, just read the disk buffer */
334334
freemask_ok = irec_free ^ !!(dip->di_mode);

fs/xfs/scrub/trace.h

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -640,6 +640,28 @@ TRACE_EVENT(xchk_iallocbt_check_cluster,
640640
__entry->cluster_ino)
641641
)
642642

643+
TRACE_EVENT(xchk_inode_is_allocated,
644+
TP_PROTO(struct xfs_inode *ip),
645+
TP_ARGS(ip),
646+
TP_STRUCT__entry(
647+
__field(dev_t, dev)
648+
__field(xfs_ino_t, ino)
649+
__field(unsigned long, iflags)
650+
__field(umode_t, mode)
651+
),
652+
TP_fast_assign(
653+
__entry->dev = VFS_I(ip)->i_sb->s_dev;
654+
__entry->ino = ip->i_ino;
655+
__entry->iflags = ip->i_flags;
656+
__entry->mode = VFS_I(ip)->i_mode;
657+
),
658+
TP_printk("dev %d:%d ino 0x%llx iflags 0x%lx mode 0x%x",
659+
MAJOR(__entry->dev), MINOR(__entry->dev),
660+
__entry->ino,
661+
__entry->iflags,
662+
__entry->mode)
663+
);
664+
643665
TRACE_EVENT(xchk_fscounters_calc,
644666
TP_PROTO(struct xfs_mount *mp, uint64_t icount, uint64_t ifree,
645667
uint64_t fdblocks, uint64_t delalloc),

0 commit comments

Comments
 (0)