Skip to content

Mutex::try_lock on poisoned mutex always fails with WouldBlock #190

@lukenels

Description

@lukenels

Background

When a panic occurs while a MutexGuard is held, the Mutex is marked as poisoned and the lock is released. Future attempts to acquire the lock will fail with PoisonError, which can be overridden by calling PoisonError::into_inner to still get at the inner value. See the standard library documentation for Mutex poisoning for details.

Shuttle's version of Mutex also supports this behavior, which is validated by the existing unit test (pasted below), which panics while holding a lock, catches the panic to avoid failing the test, then asserts that attempts to re-acquire the lock after the panic fail with PoisonError:

#[test]
fn mutex_poison() {
    check_dfs(
        || {
            let lock = Arc::new(Mutex::new(false));

            let thd = {
                let lock = Arc::clone(&lock);
                thread::spawn(move || {
                    let _err = catch_unwind(AssertUnwindSafe(move || {
                        let _lock = lock.lock().unwrap();
                        panic!("expected panic");
                    }))
                    .unwrap_err();
                })
            };

            thd.join().unwrap();

            let result = lock.lock();
            assert!(matches!(result, Err(PoisonError { .. })));
        },
        None,
    )
}

Issue

However, if we modify the test to use Mutex::try_lock instead of Mutex::lock (as below), the test no longer succeeds: we expect try_lock to fail with a TryLockError::Poisoned error, but we get a TryLockError::WouldBlock error instead, even though there is only one thread and the lock is not held:

#[test]
fn mutex_poison_try_lock() {
    check_dfs(
        || {
            let lock = Arc::new(Mutex::new(false));

            let thd = {
                let lock = Arc::clone(&lock);
                thread::spawn(move || {
                    let _err = catch_unwind(AssertUnwindSafe(move || {
                        let _lock = lock.lock().unwrap();
                        panic!("expected panic");
                    }))
                    .unwrap_err();
                })
            };

            thd.join().unwrap();

            match lock.try_lock() {
                Ok(_) => panic!("try_lock succeeded"),
                Err(TryLockError::WouldBlock) => panic!("try_lock failed with WouldBlock"),
                Err(TryLockError::Poisoned(_)) => (), // Expected case
            };
        },
        None,
    )
}

This fails with

thread 'basic::panic::mutex_poison_try_lock' panicked at shuttle/tests/basic/panic.rs:56:50:
try_lock failed with WouldBlock

Cause

The root cause appears to come from the BatchSemaphore that provides the basic concurrency primitive Shuttle uses to implement Mutex. In BatchSemaphore::release, the semaphore is marked as "closed" if ExecutionState::should_stop() is true (which is the case if std::thread::panicking() is true):

pub fn release(&self, num_permits: usize) {
    ...
    if ExecutionState::should_stop() {
        ...
        state.closed = true;
        return;
    }
    ...
}

Then, because the semaphore is marked as closed, all calls to BatchSemaphore::try_acquire return Err(TryAcquireError::Closed), which Mutex::try_lock maps to a WouldBlock error: self.semaphore.try_acquire(1).map_err(|_| TryLockError::WouldBlock)?;.

Possible direction for a fix

It seems like the root cause of the issue is the assumption that ExecutionState::should_stop() should return true if the thread is panicking. This is true if the code under test doesn't have any std::panic::catch_unwinds, because the panic is going to propagate up and cause a test failure anyways. But if there are caught panics in the test, then it's not necessarily true that the thread panicking means the test case is going to abort, and the semaphore should still be able to vend permits in this case.

However, the implementation of should_stop mentions that this behavior is required in order to avoid calling into the scheduler during a panic:

    /// We also stop if we are currently panicking (e.g., perhaps we're unwinding the stack for a
    /// panic triggered while someone held a Mutex, and so are executing the Drop handler for
    /// MutexGuard). This avoids calling back into the scheduler during a panic, because the state
    /// may be poisoned or otherwise invalid.

I'm not sure whether the implications of removing this assumption. What could happen if we remove this check, and possibly call into the scheduler while panicking? Is there a way we could signal things differently to tell the difference between a panic coming from user code vs from Shuttle itself?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions