Skip to content

Commit 9d3de7e

Browse files
OjaswinMtytso
authored andcommitted
ext4: fix rbtree traversal bug in ext4_mb_use_preallocated
During allocations, while looking for preallocations(PA) in the per inode rbtree, we can't do a direct traversal of the tree because ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted and that can cause direct traversal to skip some entries. This was leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy our request and ultimately tried to create a new PA that would overlap with the missed one. To makes sure we handle that case while still keeping the performance of the rbtree, we make use of the fact that the only pa that could possibly overlap the original goal start is the one that satisfies the below conditions: 1. It must have it's logical start immediately to the left of (ie less than) original logical start. 2. It must not be deleted To find this pa we use the following traversal method: 1. Descend into the rbtree normally to find the immediate neighboring PA. Here we keep descending irrespective of if the PA is deleted or if it overlaps with our request etc. The goal is to find an immediately adjacent PA. 2. If the found PA is on right of original goal, use rb_prev() to find the left adjacent PA. 3. Check if this PA is deleted and keep moving left with rb_prev() until a non deleted PA is found. 4. This is the PA we are looking for. Now we can check if it can satisfy the original request and proceed accordingly. This approach also takes care of having deleted PAs in the tree. (While we are at it, also fix a possible overflow bug in calculating the end of a PA) [1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/ Cc: stable@kernel.org # 6.4 Fixes: 3872778 ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list") Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Tested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com Link: https://lore.kernel.org/r/edd2efda6a83e6343c5ace9deea44813e71dbe20.1690045963.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
1 parent 5d5460f commit 9d3de7e

File tree

1 file changed

+131
-27
lines changed

1 file changed

+131
-27
lines changed

fs/ext4/mballoc.c

Lines changed: 131 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -4765,56 +4765,160 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
47654765
int order, i;
47664766
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
47674767
struct ext4_locality_group *lg;
4768-
struct ext4_prealloc_space *tmp_pa, *cpa = NULL;
4769-
ext4_lblk_t tmp_pa_start, tmp_pa_end;
4768+
struct ext4_prealloc_space *tmp_pa = NULL, *cpa = NULL;
4769+
loff_t tmp_pa_end;
47704770
struct rb_node *iter;
47714771
ext4_fsblk_t goal_block;
47724772

47734773
/* only data can be preallocated */
47744774
if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
47754775
return false;
47764776

4777-
/* first, try per-file preallocation */
4777+
/*
4778+
* first, try per-file preallocation by searching the inode pa rbtree.
4779+
*
4780+
* Here, we can't do a direct traversal of the tree because
4781+
* ext4_mb_discard_group_preallocation() can paralelly mark the pa
4782+
* deleted and that can cause direct traversal to skip some entries.
4783+
*/
47784784
read_lock(&ei->i_prealloc_lock);
4785+
4786+
if (RB_EMPTY_ROOT(&ei->i_prealloc_node)) {
4787+
goto try_group_pa;
4788+
}
4789+
4790+
/*
4791+
* Step 1: Find a pa with logical start immediately adjacent to the
4792+
* original logical start. This could be on the left or right.
4793+
*
4794+
* (tmp_pa->pa_lstart never changes so we can skip locking for it).
4795+
*/
47794796
for (iter = ei->i_prealloc_node.rb_node; iter;
47804797
iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical,
4781-
tmp_pa_start, iter)) {
4798+
tmp_pa->pa_lstart, iter)) {
47824799
tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
47834800
pa_node.inode_node);
4801+
}
47844802

4785-
/* all fields in this condition don't change,
4786-
* so we can skip locking for them */
4787-
tmp_pa_start = tmp_pa->pa_lstart;
4788-
tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
4789-
4790-
/* original request start doesn't lie in this PA */
4791-
if (ac->ac_o_ex.fe_logical < tmp_pa_start ||
4792-
ac->ac_o_ex.fe_logical >= tmp_pa_end)
4793-
continue;
4803+
/*
4804+
* Step 2: The adjacent pa might be to the right of logical start, find
4805+
* the left adjacent pa. After this step we'd have a valid tmp_pa whose
4806+
* logical start is towards the left of original request's logical start
4807+
*/
4808+
if (tmp_pa->pa_lstart > ac->ac_o_ex.fe_logical) {
4809+
struct rb_node *tmp;
4810+
tmp = rb_prev(&tmp_pa->pa_node.inode_node);
47944811

4795-
/* non-extent files can't have physical blocks past 2^32 */
4796-
if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
4797-
(tmp_pa->pa_pstart + EXT4_C2B(sbi, tmp_pa->pa_len) >
4798-
EXT4_MAX_BLOCK_FILE_PHYS)) {
4812+
if (tmp) {
4813+
tmp_pa = rb_entry(tmp, struct ext4_prealloc_space,
4814+
pa_node.inode_node);
4815+
} else {
47994816
/*
4800-
* Since PAs don't overlap, we won't find any
4801-
* other PA to satisfy this.
4817+
* If there is no adjacent pa to the left then finding
4818+
* an overlapping pa is not possible hence stop searching
4819+
* inode pa tree
48024820
*/
4803-
break;
4821+
goto try_group_pa;
48044822
}
4823+
}
48054824

4806-
/* found preallocated blocks, use them */
4825+
BUG_ON(!(tmp_pa && tmp_pa->pa_lstart <= ac->ac_o_ex.fe_logical));
4826+
4827+
/*
4828+
* Step 3: If the left adjacent pa is deleted, keep moving left to find
4829+
* the first non deleted adjacent pa. After this step we should have a
4830+
* valid tmp_pa which is guaranteed to be non deleted.
4831+
*/
4832+
for (iter = &tmp_pa->pa_node.inode_node;; iter = rb_prev(iter)) {
4833+
if (!iter) {
4834+
/*
4835+
* no non deleted left adjacent pa, so stop searching
4836+
* inode pa tree
4837+
*/
4838+
goto try_group_pa;
4839+
}
4840+
tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
4841+
pa_node.inode_node);
48074842
spin_lock(&tmp_pa->pa_lock);
4808-
if (tmp_pa->pa_deleted == 0 && tmp_pa->pa_free &&
4809-
likely(ext4_mb_pa_goal_check(ac, tmp_pa))) {
4810-
atomic_inc(&tmp_pa->pa_count);
4811-
ext4_mb_use_inode_pa(ac, tmp_pa);
4843+
if (tmp_pa->pa_deleted == 0) {
4844+
/*
4845+
* We will keep holding the pa_lock from
4846+
* this point on because we don't want group discard
4847+
* to delete this pa underneath us. Since group
4848+
* discard is anyways an ENOSPC operation it
4849+
* should be okay for it to wait a few more cycles.
4850+
*/
4851+
break;
4852+
} else {
48124853
spin_unlock(&tmp_pa->pa_lock);
4813-
read_unlock(&ei->i_prealloc_lock);
4814-
return true;
48154854
}
4855+
}
4856+
4857+
BUG_ON(!(tmp_pa && tmp_pa->pa_lstart <= ac->ac_o_ex.fe_logical));
4858+
BUG_ON(tmp_pa->pa_deleted == 1);
4859+
4860+
/*
4861+
* Step 4: We now have the non deleted left adjacent pa. Only this
4862+
* pa can possibly satisfy the request hence check if it overlaps
4863+
* original logical start and stop searching if it doesn't.
4864+
*/
4865+
tmp_pa_end = (loff_t)tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
4866+
4867+
if (ac->ac_o_ex.fe_logical >= tmp_pa_end) {
4868+
spin_unlock(&tmp_pa->pa_lock);
4869+
goto try_group_pa;
4870+
}
4871+
4872+
/* non-extent files can't have physical blocks past 2^32 */
4873+
if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
4874+
(tmp_pa->pa_pstart + EXT4_C2B(sbi, tmp_pa->pa_len) >
4875+
EXT4_MAX_BLOCK_FILE_PHYS)) {
4876+
/*
4877+
* Since PAs don't overlap, we won't find any other PA to
4878+
* satisfy this.
4879+
*/
48164880
spin_unlock(&tmp_pa->pa_lock);
4881+
goto try_group_pa;
4882+
}
4883+
4884+
if (tmp_pa->pa_free && likely(ext4_mb_pa_goal_check(ac, tmp_pa))) {
4885+
atomic_inc(&tmp_pa->pa_count);
4886+
ext4_mb_use_inode_pa(ac, tmp_pa);
4887+
spin_unlock(&tmp_pa->pa_lock);
4888+
read_unlock(&ei->i_prealloc_lock);
4889+
return true;
4890+
} else {
4891+
/*
4892+
* We found a valid overlapping pa but couldn't use it because
4893+
* it had no free blocks. This should ideally never happen
4894+
* because:
4895+
*
4896+
* 1. When a new inode pa is added to rbtree it must have
4897+
* pa_free > 0 since otherwise we won't actually need
4898+
* preallocation.
4899+
*
4900+
* 2. An inode pa that is in the rbtree can only have it's
4901+
* pa_free become zero when another thread calls:
4902+
* ext4_mb_new_blocks
4903+
* ext4_mb_use_preallocated
4904+
* ext4_mb_use_inode_pa
4905+
*
4906+
* 3. Further, after the above calls make pa_free == 0, we will
4907+
* immediately remove it from the rbtree in:
4908+
* ext4_mb_new_blocks
4909+
* ext4_mb_release_context
4910+
* ext4_mb_put_pa
4911+
*
4912+
* 4. Since the pa_free becoming 0 and pa_free getting removed
4913+
* from tree both happen in ext4_mb_new_blocks, which is always
4914+
* called with i_data_sem held for data allocations, we can be
4915+
* sure that another process will never see a pa in rbtree with
4916+
* pa_free == 0.
4917+
*/
4918+
WARN_ON_ONCE(tmp_pa->pa_free == 0);
48174919
}
4920+
spin_unlock(&tmp_pa->pa_lock);
4921+
try_group_pa:
48184922
read_unlock(&ei->i_prealloc_lock);
48194923

48204924
/* can we use group allocation? */

0 commit comments

Comments
 (0)