Skip to content

Commit 31158ad

Browse files
kkdwivediAlexei Starovoitov
authored andcommitted
rqspinlock: Add deadlock detection and recovery
While the timeout logic provides guarantees for the waiter's forward progress, the time until a stalling waiter unblocks can still be long. The default timeout of 1/4 sec can be excessively long for some use cases. Additionally, custom timeouts may exacerbate recovery time. Introduce logic to detect common cases of deadlocks and perform quicker recovery. This is done by dividing the time from entry into the locking slow path until the timeout into intervals of 1 ms. Then, after each interval elapses, deadlock detection is performed, while also polling the lock word to ensure we can quickly break out of the detection logic and proceed with lock acquisition. A 'held_locks' table is maintained per-CPU where the entry at the bottom denotes a lock being waited for or already taken. Entries coming before it denote locks that are already held. The current CPU's table can thus be looked at to detect AA deadlocks. The tables from other CPUs can be looked at to discover ABBA situations. Finally, when a matching entry for the lock being taken on the current CPU is found on some other CPU, a deadlock situation is detected. This function can take a long time, therefore the lock word is constantly polled in each loop iteration to ensure we can preempt detection and proceed with lock acquisition, using the is_lock_released check. We set 'spin' member of rqspinlock_timeout struct to 0 to trigger deadlock checks immediately to perform faster recovery. Note: Extending lock word size by 4 bytes to record owner CPU can allow faster detection for ABBA. It is typically the owner which participates in a ABBA situation. However, to keep compatibility with existing lock words in the kernel (struct qspinlock), and given deadlocks are a rare event triggered by bugs, we choose to favor compatibility over faster detection. The release_held_lock_entry function requires an smp_wmb, while the release store on unlock will provide the necessary ordering for us. Add comments to document the subtleties of why this is correct. It is possible for stores to be reordered still, but in the context of the deadlock detection algorithm, a release barrier is sufficient and needn't be stronger for unlock's case. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250316040541.108729-13-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
1 parent 3bb1593 commit 31158ad

File tree

2 files changed

+273
-14
lines changed

2 files changed

+273
-14
lines changed

include/asm-generic/rqspinlock.h

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111

1212
#include <linux/types.h>
1313
#include <vdso/time64.h>
14+
#include <linux/percpu.h>
1415

1516
struct qspinlock;
1617
typedef struct qspinlock rqspinlock_t;
@@ -22,4 +23,103 @@ extern int resilient_queued_spin_lock_slowpath(rqspinlock_t *lock, u32 val);
2223
*/
2324
#define RES_DEF_TIMEOUT (NSEC_PER_SEC / 4)
2425

26+
/*
27+
* Choose 31 as it makes rqspinlock_held cacheline-aligned.
28+
*/
29+
#define RES_NR_HELD 31
30+
31+
struct rqspinlock_held {
32+
int cnt;
33+
void *locks[RES_NR_HELD];
34+
};
35+
36+
DECLARE_PER_CPU_ALIGNED(struct rqspinlock_held, rqspinlock_held_locks);
37+
38+
static __always_inline void grab_held_lock_entry(void *lock)
39+
{
40+
int cnt = this_cpu_inc_return(rqspinlock_held_locks.cnt);
41+
42+
if (unlikely(cnt > RES_NR_HELD)) {
43+
/* Still keep the inc so we decrement later. */
44+
return;
45+
}
46+
47+
/*
48+
* Implied compiler barrier in per-CPU operations; otherwise we can have
49+
* the compiler reorder inc with write to table, allowing interrupts to
50+
* overwrite and erase our write to the table (as on interrupt exit it
51+
* will be reset to NULL).
52+
*
53+
* It is fine for cnt inc to be reordered wrt remote readers though,
54+
* they won't observe our entry until the cnt update is visible, that's
55+
* all.
56+
*/
57+
this_cpu_write(rqspinlock_held_locks.locks[cnt - 1], lock);
58+
}
59+
60+
/*
61+
* We simply don't support out-of-order unlocks, and keep the logic simple here.
62+
* The verifier prevents BPF programs from unlocking out-of-order, and the same
63+
* holds for in-kernel users.
64+
*
65+
* It is possible to run into misdetection scenarios of AA deadlocks on the same
66+
* CPU, and missed ABBA deadlocks on remote CPUs if this function pops entries
67+
* out of order (due to lock A, lock B, unlock A, unlock B) pattern. The correct
68+
* logic to preserve right entries in the table would be to walk the array of
69+
* held locks and swap and clear out-of-order entries, but that's too
70+
* complicated and we don't have a compelling use case for out of order unlocking.
71+
*/
72+
static __always_inline void release_held_lock_entry(void)
73+
{
74+
struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
75+
76+
if (unlikely(rqh->cnt > RES_NR_HELD))
77+
goto dec;
78+
WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL);
79+
dec:
80+
/*
81+
* Reordering of clearing above with inc and its write in
82+
* grab_held_lock_entry that came before us (in same acquisition
83+
* attempt) is ok, we either see a valid entry or NULL when it's
84+
* visible.
85+
*
86+
* But this helper is invoked when we unwind upon failing to acquire the
87+
* lock. Unlike the unlock path which constitutes a release store after
88+
* we clear the entry, we need to emit a write barrier here. Otherwise,
89+
* we may have a situation as follows:
90+
*
91+
* <error> for lock B
92+
* release_held_lock_entry
93+
*
94+
* try_cmpxchg_acquire for lock A
95+
* grab_held_lock_entry
96+
*
97+
* Lack of any ordering means reordering may occur such that dec, inc
98+
* are done before entry is overwritten. This permits a remote lock
99+
* holder of lock B (which this CPU failed to acquire) to now observe it
100+
* as being attempted on this CPU, and may lead to misdetection (if this
101+
* CPU holds a lock it is attempting to acquire, leading to false ABBA
102+
* diagnosis).
103+
*
104+
* In case of unlock, we will always do a release on the lock word after
105+
* releasing the entry, ensuring that other CPUs cannot hold the lock
106+
* (and make conclusions about deadlocks) until the entry has been
107+
* cleared on the local CPU, preventing any anomalies. Reordering is
108+
* still possible there, but a remote CPU cannot observe a lock in our
109+
* table which it is already holding, since visibility entails our
110+
* release store for the said lock has not retired.
111+
*
112+
* In theory we don't have a problem if the dec and WRITE_ONCE above get
113+
* reordered with each other, we either notice an empty NULL entry on
114+
* top (if dec succeeds WRITE_ONCE), or a potentially stale entry which
115+
* cannot be observed (if dec precedes WRITE_ONCE).
116+
*
117+
* Emit the write barrier _before_ the dec, this permits dec-inc
118+
* reordering but that is harmless as we'd have new entry set to NULL
119+
* already, i.e. they cannot precede the NULL store above.
120+
*/
121+
smp_wmb();
122+
this_cpu_dec(rqspinlock_held_locks.cnt);
123+
}
124+
25125
#endif /* __ASM_GENERIC_RQSPINLOCK_H */

0 commit comments

Comments
 (0)