Skip to content

Commit fea4e31

Browse files
hansendctorvalds
authored andcommitted
x86/mm: Eliminate window where TLB flushes may be inadvertently skipped
tl;dr: There is a window in the mm switching code where the new CR3 is set and the CPU should be getting TLB flushes for the new mm. But should_flush_tlb() has a bug and suppresses the flush. Fix it by widening the window where should_flush_tlb() sends an IPI. Long Version: === History === There were a few things leading up to this. First, updating mm_cpumask() was observed to be too expensive, so it was made lazier. But being lazy caused too many unnecessary IPIs to CPUs due to the now-lazy mm_cpumask(). So code was added to cull mm_cpumask() periodically[2]. But that culling was a bit too aggressive and skipped sending TLB flushes to CPUs that need them. So here we are again. === Problem === The too-aggressive code in should_flush_tlb() strikes in this window: // Turn on IPIs for this CPU/mm combination, but only // if should_flush_tlb() agrees: cpumask_set_cpu(cpu, mm_cpumask(next)); next_tlb_gen = atomic64_read(&next->context.tlb_gen); choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush); load_new_mm_cr3(need_flush); // ^ After 'need_flush' is set to false, IPIs *MUST* // be sent to this CPU and not be ignored. this_cpu_write(cpu_tlbstate.loaded_mm, next); // ^ Not until this point does should_flush_tlb() // become true! should_flush_tlb() will suppress TLB flushes between load_new_mm_cr3() and writing to 'loaded_mm', which is a window where they should not be suppressed. Whoops. === Solution === Thankfully, the fuzzy "just about to write CR3" window is already marked with loaded_mm==LOADED_MM_SWITCHING. Simply checking for that state in should_flush_tlb() is sufficient to ensure that the CPU is targeted with an IPI. This will cause more TLB flush IPIs. But the window is relatively small and I do not expect this to cause any kind of measurable performance impact. Update the comment where LOADED_MM_SWITCHING is written since it grew yet another user. Peter Z also raised a concern that should_flush_tlb() might not observe 'loaded_mm' and 'is_lazy' in the same order that switch_mm_irqs_off() writes them. Add a barrier to ensure that they are observed in the order they are written. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rik van Riel <riel@surriel.com> Link: https://lore.kernel.org/oe-lkp/202411282207.6bd28eae-lkp@intel.com/ [1] Fixes: 6db2526 ("x86/mm/tlb: Only trim the mm_cpumask once a second") [2] Reported-by: Stephen Dolan <sdolan@janestreet.com> Cc: stable@vger.kernel.org Acked-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 9c69f88 commit fea4e31

File tree

1 file changed

+19
-3
lines changed

1 file changed

+19
-3
lines changed

arch/x86/mm/tlb.c

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -899,8 +899,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
899899
cond_mitigation(tsk);
900900

901901
/*
902-
* Let nmi_uaccess_okay() and finish_asid_transition()
903-
* know that CR3 is changing.
902+
* Indicate that CR3 is about to change. nmi_uaccess_okay()
903+
* and others are sensitive to the window where mm_cpumask(),
904+
* CR3 and cpu_tlbstate.loaded_mm are not all in sync.
904905
*/
905906
this_cpu_write(cpu_tlbstate.loaded_mm, LOADED_MM_SWITCHING);
906907
barrier();
@@ -1204,8 +1205,16 @@ static void flush_tlb_func(void *info)
12041205

12051206
static bool should_flush_tlb(int cpu, void *data)
12061207
{
1208+
struct mm_struct *loaded_mm = per_cpu(cpu_tlbstate.loaded_mm, cpu);
12071209
struct flush_tlb_info *info = data;
12081210

1211+
/*
1212+
* Order the 'loaded_mm' and 'is_lazy' against their
1213+
* write ordering in switch_mm_irqs_off(). Ensure
1214+
* 'is_lazy' is at least as new as 'loaded_mm'.
1215+
*/
1216+
smp_rmb();
1217+
12091218
/* Lazy TLB will get flushed at the next context switch. */
12101219
if (per_cpu(cpu_tlbstate_shared.is_lazy, cpu))
12111220
return false;
@@ -1214,8 +1223,15 @@ static bool should_flush_tlb(int cpu, void *data)
12141223
if (!info->mm)
12151224
return true;
12161225

1226+
/*
1227+
* While switching, the remote CPU could have state from
1228+
* either the prev or next mm. Assume the worst and flush.
1229+
*/
1230+
if (loaded_mm == LOADED_MM_SWITCHING)
1231+
return true;
1232+
12171233
/* The target mm is loaded, and the CPU is not lazy. */
1218-
if (per_cpu(cpu_tlbstate.loaded_mm, cpu) == info->mm)
1234+
if (loaded_mm == info->mm)
12191235
return true;
12201236

12211237
/* In cpumask, but not the loaded mm? Periodically remove by flushing. */

0 commit comments

Comments
 (0)