-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Background
When a panic occurs while a MutexGuard
is held, the Mutex
is marked as poisoned and the lock is released. Future attempts to acquire the lock will fail with PoisonError
, which can be overridden by calling PoisonError::into_inner
to still get at the inner value. See the standard library documentation for Mutex poisoning for details.
Shuttle's version of Mutex
also supports this behavior, which is validated by the existing unit test (pasted below), which panics while holding a lock, catches the panic to avoid failing the test, then asserts that attempts to re-acquire the lock after the panic fail with PoisonError
:
#[test]
fn mutex_poison() {
check_dfs(
|| {
let lock = Arc::new(Mutex::new(false));
let thd = {
let lock = Arc::clone(&lock);
thread::spawn(move || {
let _err = catch_unwind(AssertUnwindSafe(move || {
let _lock = lock.lock().unwrap();
panic!("expected panic");
}))
.unwrap_err();
})
};
thd.join().unwrap();
let result = lock.lock();
assert!(matches!(result, Err(PoisonError { .. })));
},
None,
)
}
Issue
However, if we modify the test to use Mutex::try_lock
instead of Mutex::lock
(as below), the test no longer succeeds: we expect try_lock
to fail with a TryLockError::Poisoned
error, but we get a TryLockError::WouldBlock
error instead, even though there is only one thread and the lock is not held:
#[test]
fn mutex_poison_try_lock() {
check_dfs(
|| {
let lock = Arc::new(Mutex::new(false));
let thd = {
let lock = Arc::clone(&lock);
thread::spawn(move || {
let _err = catch_unwind(AssertUnwindSafe(move || {
let _lock = lock.lock().unwrap();
panic!("expected panic");
}))
.unwrap_err();
})
};
thd.join().unwrap();
match lock.try_lock() {
Ok(_) => panic!("try_lock succeeded"),
Err(TryLockError::WouldBlock) => panic!("try_lock failed with WouldBlock"),
Err(TryLockError::Poisoned(_)) => (), // Expected case
};
},
None,
)
}
This fails with
thread 'basic::panic::mutex_poison_try_lock' panicked at shuttle/tests/basic/panic.rs:56:50:
try_lock failed with WouldBlock
Cause
The root cause appears to come from the BatchSemaphore
that provides the basic concurrency primitive Shuttle uses to implement Mutex
. In BatchSemaphore::release
, the semaphore is marked as "closed" if ExecutionState::should_stop()
is true (which is the case if std::thread::panicking()
is true):
pub fn release(&self, num_permits: usize) {
...
if ExecutionState::should_stop() {
...
state.closed = true;
return;
}
...
}
Then, because the semaphore is marked as closed, all calls to BatchSemaphore::try_acquire
return Err(TryAcquireError::Closed)
, which Mutex::try_lock
maps to a WouldBlock
error: self.semaphore.try_acquire(1).map_err(|_| TryLockError::WouldBlock)?;
.
Possible direction for a fix
It seems like the root cause of the issue is the assumption that ExecutionState::should_stop()
should return true if the thread is panicking. This is true if the code under test doesn't have any std::panic::catch_unwind
s, because the panic is going to propagate up and cause a test failure anyways. But if there are caught panics in the test, then it's not necessarily true that the thread panicking means the test case is going to abort, and the semaphore should still be able to vend permits in this case.
However, the implementation of should_stop
mentions that this behavior is required in order to avoid calling into the scheduler during a panic:
/// We also stop if we are currently panicking (e.g., perhaps we're unwinding the stack for a
/// panic triggered while someone held a Mutex, and so are executing the Drop handler for
/// MutexGuard). This avoids calling back into the scheduler during a panic, because the state
/// may be poisoned or otherwise invalid.
I'm not sure whether the implications of removing this assumption. What could happen if we remove this check, and possibly call into the scheduler while panicking? Is there a way we could signal things differently to tell the difference between a panic coming from user code vs from Shuttle itself?